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Editor's Introduction 




Richard W. Beane 

Editor 



The Digital Technical Journal bridges a gap 
in the information published about Digital s 
products by providing an explanation or' 
their technological foundations. In the past, 
such explanations appeared in papers writ- 
ten by Digital s engineers tor various periodi- 
cals Unfortunately, anyone wanting concise 
technical details had to search through the 
gamut of this literature. 

This journal was created to present that 
information in one publication. The papers 
are written by the engineers who developed 
the products, in terms of the technologies 
that went into their designs. Our audience is 
composed of engineers within Digital, as 
well as engineering educators and 
customers. 

This issue, our inaugural one, features the 
VAX 8600 processor. Its design, with a pipe- 
lined architecture and emitter-coupled 
logic, offers many innovations besides 
increased speed. New packaging, cooling, 
and reliability techniques, and new auto- 
mated simulation tools were used to develop 
this product. Some papers explain the final 
results of the development process, others 
discuss the process itself. All give the reader 
a sense of the unique ways in which Digital 
develops its products. 

The first paper, by Tryggve Fossum, Jim 
McFlroy, and Bill Fnglish, is an overview of 
the 8600 s salient features. The distribution 
of processing into the various "boxes," the 
specific performance improvements, and the 
reliability and data integrity features are 
amply discussed. This paper establishes a 
framework to assist the reader in fitting the 
more detailed papers into an overall context. 



The pipeline paper, by Mario Troiani, 
Steve Ching, Nii Quaynor, John Bloem, and 
Fernando Colon Osorio, explains the 
VAX 8600 pipeline in terms of a general 
model. This starting point is important in 
understanding the unique contribution of 
the pipeline's key element, the instruction 
prefetch unit. The paper explains how this 
unit fetches instructions, achieves control, 
and maintains data integrity. 

The 8600 features fast, efficient floating 
point operations. The paper by Tryggve Fos- 
sum, Bill Grundmann, and Ginny Blaha dis- 
cusses the instruction flow in the floating 
point accelerator and the role of emitter-cou- 
pled logic in its design. The authors describe 
how algorithms are processed and how 
microcode controls those operations. 

The next three topics are closely related 
because decisions in packaging, signaling, 
and cooling must be made with their inter- 
acting effects in mind. The paper on packag- 
ing, by Jim McElroy, discusses the evolution 
of the process that identified the best way to 
package the modules and components. John 
Hackenberg's paper on signal integrity 
describes the software tools that enabled the 
design team to distribute power while con- 
trolling noise and avoiding cooling 
problems. The solutions to those problems, 
including the use of thermal design rules and 
special measuring techniques, are discussed 
by Brian Kalita and Bill Fnglish. 

The final paper, by Bill Bruckert and Ron 
Josephson, explains why reliability consists 
of the avoidance, tolerance, and minimiza- 
tion of faults, and the improvement of M'lTR. 
The authors discuss the techniques used to 
reduce failures, to identify those that do 
occur, and to make repairs easier 

These papers represent a cross section of 
the activities in a large design project, and 
they relate the results of design decisions as 
well as the process for making them. 
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How appropriate it is that this first issue of the 
Digital Technical Journal, a medium for com- 
municating new technical ideas and results 
within Digital, should be dedicated to the 
VAX 8600 system. The 8600 represents the con- 
fluence of many new concepts and much good 
engineering in the areas of implementation 
architecture, interconnect, packaging, cooling, 
design methodology and tools, CPU and systems 
design verification, and complexity management. 

The VAX 8600, or VENUS, program 
approached the problem of producing a high- 
performance VAX system in two ways. First, we 
reduced the cycle time by physical means. Sec- 
ond, by incorporating new design techniques, 
we reduced the average number of cycles 
required to implement instructions over a wide 
range of typical uses. The performance range of 
the 8600 makes it appropriate for customers 
with requirements close to those provided by 
mainframes. Therefore, we had to address main- 
frame reliability, maintainability, and lifetime 
cost-of-ownership issues from the beginning of 
the project. For this reason several new con- 
cepts had to be integrated into the design. 

The key concept of the new physical technol- 
ogy incorporated in the 8600 is the use of ECI. 
gate arrays called macrocell arrays, developed 
jointly by Digital in Marlboro and Motorola in 
Phoenix. In order to deal with the speed of 
ECI., we had to pay special attention to board, 
connector and backplane impedance and delay, 
as well as manufacturing problems. Incorporat- 
ing ECI. yielded a cycle time of 80 
nanoseconds. Compared to 200 nanoseconds 
on the VAX-1 1/780 system, that represents a 
performance gain of 2. 5, which is the minimum 
gain without architectural improvements. 

The architectural challenge in this imple- 
mentation was to increase the VAX 8600 per- 
formance by 1 .5 to 2.5 times that of the 1 1/780 
by executing more of the functions of each 
instruction during every cycle. Meeting this 
challenge required that the operations of 
instruction decoding and execution take place 
in parallel to a greater degree than in any previ- 
ous VAX implementation. Thus the concept of 
pipelining became a necessity in the VAX 8600 
implementation. Moreover, the higher speeds 
required different approaches to cache 



management, memory busing and management, 
and I/O. In particular, the concept of a 
''writeback" cache was introduced to reduce 
the number of times that individual accesses to 
slower main memory are needed. Furthermore, 
the mem or)' and I/O buses were separated to 
allow higher memory bandwidth, which 
decreases the amount of needed memory, and 
to avoid I/O interference problems. 

The resulting design, which has from 100- 
200 thousand gates (depending on how gates 
are counted), introduced new levels of com- 
plexity in both design and management that 
stretched us all into new domains of knowledge 
and maturity. For example, we discovered quite 
early in the program that our classic design 
approach of quickly designing on paper, build- 
ing prototypes, and debugging them would 
NOT work. The design turnaround times for the 
chips alone would have gotten us to market 
much too late to be competitive. Thus we 
began the process of simulating, debugging and 
verifying the 8600 by using other computers 
instead of moving wires. That process required 
us to develop new tools for timing analyses, 
such as AITODLY, and new methods for build- 
ing data bases. Moreover, new techniques had 
to be devised for finding and fixing problems 
by using tools and libraries instead of real 
design bugs. And, of course, computer 
resources had to be identified, ordered, and 
installed. 

Initially we had planned to use four KL-I0 
systems and a VAX- 1 1/780 system as the com- 
puter resources needed for the whole program. 
We soon found that more machines were 
needed quickly if we were to succeed with sim- 
ulation. In the course of the next two years we 
installed about one new system per month, end- 
ing with not only twelve KI.-l()s but also twelve 
ll/78()s. Simulation was a tremendous chal- 
lenge to the whole organization and required 
close cooperation from our partners in other 
groups, especially manufacturing and CSSF. 
The former helped us to get equipment and 
loaned us space, and the latter moved their own 
work around and loaned us systems and people 
to complete the simulation and verification. 
Moreover, networking at a much higher level 
was then needed and communications between 



the 1 l/78()s and the KI.-10 systems had to be 
improved. Our Site Resources and Fngineering 
Group had to accommodate these changes, and 
their capabilities grew continually within the 
available constraints of time and space. 

Finally, a word about the management of the 
VAX 8600 program. First, I believe we learned 
a great deal about the management methodolo- 
gies required to produce a product as complex 
as the VAX 8600 system. Our fundamental phi- 
losophy was open communications at all levels 
of the project. We fostered the attitude that 
finding problems, discussing them, and asking 
for help were signs of intelligence and matur- 
ity, not ones of weakness or failure. To succeed, 
we knew this was the "right thing to do." We 
also developed a review process that 
encouraged project members and other groups 
to see our progress. This process included regu 
lar, open reviews for all project levels, weekly 
program reviews for all groups involved in the 
project (manufacturing, CSSF, VMS, semicon- 
ductors, purchasing, etc.), and monthly 
reviews for people throughout the company 
who were less directly involved. 

During the course of the program these 
reviews allowed the development of close pro- 
fessional and personal relationships that clearly 
helped us to meet the VAX 8600 program per- 
formance, function, cost, quality, volume and 
schedule goals. 

The following papers represent a cross sec- 
tion of the problems addressed, solutions 
found, and successes achieved in the course of 
developing the 8600. Many topics could have 
been included, but this group should provide 
the reader with some insight into the product 
design and management processes associated 
with this program 

At this time 1 would like to acknowledge for- 
mally all the people not only within the High 
Performance Systems and Clusters Group, but 
especially those outside this group who con- 
tributed so creatively and generously to this 
program. Naming each of them would surely 
consume the remaining pages of this journal 
since there were over 40 different facilities, 
and at least that many groups, involved in the 
VENUS program. The success of the VAX 8600 
system is their success! 
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An Overview of the 
VAX 8600 System 

The VAX 8600 system handles 5 million Whetstones per second, which is 
over four times faster than the VAX- 11/780 system. The 8600 uses pipe- 
lined instructions, a bigger cache memory, and a dedicated memory bus to 
achieve its speed. Inside, small processors-called boxes-perform tasks 
simultaneously. The I Box prefetches instructions while the E Box executes 
others; the F Box performs fast floating point operations, as do all VAX 
systems. Macrocell array technology, with fast gate speeds, and microcode 
control are used throughout. These aspects, plus a new cooling system and 
interconnect innovations, make the VAX 8600 system vety reliable. 



The main design objective of the VAX 8600 
project was to gain a significant improvement 
in VAX computing performance with a minimal 
cost increase. Furthermore, the 8600 had to 
retain all the characteristics common to the 
32-bit VAX Family. These characteristics 
included the following requirements: the new 
machine must run the VMS operating system, 
must interconnect to the present I/O bus struc- 
tures, and must have the network links associ- 
ated with the VAX computing environment. 
Improved performance is achieved through 
innovations in computer design and the intro- 
duction of large scale processing concepts 
into the VAX architecture. Innovations include 
the use of ECL macrocell arrays (MCAs) 
throughout the CPU and new electrical and 
mechanical packaging. Among the large scale 
processing concepts employed are a dedicated 
memory bus and pipelined operation in both 
instruction processing and memory references. 

Designing a large scale computer is a process 
driven by Digital's performance goals for the 
machine. On some projects, little time remains 
to evaluate the relative costs of equivalent 
alternatives. All VAX systems, however, must 
meet price/performance design criteria, the 
most important of which is the customer s 
overall cost of ownership. Therefore, to meet 
those criteria, we used many techniques to 



enhance the system's reliability, availability 
and maintainability. 

TJje VAX 8600 System 

The VAX 8600 processor (Figure 1 ) consists of 
six relatively independent su bprocessors : 
E Box, F Box, I Box, M Box, console, and I/O 
adapter. The E Box executes the VAX instruc- 
tion set and generally directs the entire system. 
The I Box prefetches instructions and operands 
and decodes them for later execution by the 
E Box. This gives the machine a pipelined struc- 
ture: several instructions can be present in the 
I Box and the E Box at the same time. The pipe- 
line enables some frequently executed instruc- 
tions to be completed in the E Box in a single 
machine cycle of 80 nanoseconds. 

The M Box contains a I6-kilobyte data cache 
to increase the speed of memory access. It also 
contains a buffer that holds recently used trans- 
lations of virtual memory addresses to physical 
ones. Using a translation buffer eliminates the 
need to look up these addresses for every mem- 
ory access. The M Box interfaces the memory to 
all other parts of the system, and also interfaces 
the E, F, and I Boxes to the adapter bus for 
input and output. A "memory reference" by 
one of the other boxes happens in a "cache 
cycle," the objective of the design being to deal 
solely with the high-speed cache as often as 
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possible. The M Box actually references storage 
only when needed data is not found in the 
cache or when room needs to be made in it for 
new data. As elsewhere in the machine, the 
M Box has a variety of reliability and maintain- 
ability features, including error correction on 
the data in the cache. 

The F Box is a floating point processor or 
accelerator (FPA). When present in a system, 
the F Box intercepts floating point instructions 
as they are presented by the I Box. Special hard- 
ware for fast unpacking, aligning, adding, mul- 
tiplying and dividing produces extra-high per- 
formance for scientific, computational number 
crunching. 

The 1 6 general purpose registers (GPRs)-the 
I and P Boxes each have one set and the E Box, 
two-are basic to the accurate and fast manipu- 
lation of data. Therefore, altogether, four cop- 
ies of the GPRs are kept to guarantee fast, flexi- 
ble access and instruction retry. 

The console is a microprocessor-based front 
end interface to the operator, the boot device, 
and the remote diagnostics. This unit is used to 
initialize the system on power-up, to test it, and 
to assist in isolating faults. The console also 
automatically handles various functions that are 
usually performed manually by an operator. 



The I/O system is based on Digital's standard 
synchronous backplane interconnect (SBI), 
which is interfaced to the rest of the system via 
the M Box through an adapter on the 
adapter bus. The various device controllers and 
adapters to other interconnects are located on 
the SBI. 

Although all boxes contain microcode, the 
main microcode is in the E Box. This allows the 
boxes to perform complex functions with a 
small amount of hardware, providing design 
flexibility and a good cost/performance ratio. 
All microcode storage is writable, which facili- 
tates changes and additions whenever neces- 
sary. Initially, the RAMs are loaded from 
microcode files stored on a removable disk in 
the console subsystem. Microcoded diagnostic 
programs are also loaded in the control store 
when it is necessary to identify failing 
components. 

A number of buses interconnect the various 
boxes. All data movement between the proces- 
sor and both the memory array and the I/O sub- 
system occurs through the memory data bus 
connecting the M Box and the I Box. The I Box 
receives the instruction stream and the memory 
operands over this bus; the memory operands 
are then passed to the E Box and the F Box over 
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the operand bus. Results from either of those 
boxes arc sent via the write bus to the I Box, 
which in turn passes them to the M Box over 
the memory data bus. The write bus is also used 
to keep the four sets of GPRs identical to one 
another. Both the I Box and the E Box supply 
addresses (almost always virtual) to the M Box. 
All buses and registers handle 3 2 -bit words. 

The component technology used in the 8600 
is the macrocell array, which provides a typical 
gate speed of one nanosecond and has high- 
density LSI ECL technology in a 68-pin package 
that is one inch square. MCA technology is an 
extension of the gate array concept. Instead of 
gates, however, each cell in the array contains a 
number of unconnected transistors and resis- 
tors. By creating interconnecting patterns, one 
can transform those components into small- 
scale/medium-scale integration (SSI/MSI) logic 
functions or "macros." These macros take the 



form of standard logic elements such as dual 
D-type flipflops, dual full adders, quad latches, 
and the like. Most of them are series-gated ECL 
structures for optimized performance. 

E Box, Heart of the System 
The E Box, the focal point of the entire system, 
executes the VAX instruction set, handles 
exceptions and interrupts, and controls the rest 
of the system. It is highly microcoded: most of 
its elements are directly controlled in each 
cycle by bits in the micro word. Intensive 
microcoding makes possible the use of a 
datapath with a simple structure; the power of 
the datapath comes from the speed and ease 
with which it can be manipulated by the 
microcode. 

As shown in Figure 2, the E Box contains a 
dual-ported scratchpad memory (Register Files 
A and B) comprising 256 32-bit registers. In the 
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scratchpad are basic machine registers, copies 
of the GPRs, about 150 constants and 
microcode temporaries, and some architectur- 
ally defined registers used by memory manage- 
ment and the operating system. 

Arithmetic and logical operations are done by 
a 32-bit arithmetic and logic unit (ALU), which 
has all the usual functions for performing add, 
subtract, OR, exclusive OR, and similar opera- 
tions. There are also some special ALU func- 
tions for speeding division, decimal arithmetic, 
and comparisons. The most significant perform- 
ance factor related to the ALU, however, is the 
ability of the microcode to take any two values 
from the scratchpad, operate on them in the 
ALU, and store the result back in the 
scratchpad-all in a single cycle. With this capa- 
bility, some whole instructions can be com- 
pleted in just one cycle. And longer, repetitive 
instructions, such as those handling character 
strings, can be executed in short loops. 

Paralleling the ALU is a barrel-shift network 
that accepts a 64 -bit value, joins it end to end, 
and selects any desired 32 consecutive bits 
from the ring format. The value can be supplied 
by two scratchpad registers or one register con- 
catenated with memory data. Control over the 
shifter can be exercised directly by a field in 
the microword, or through a shift control regis- 
ter. The register allows a new shift count 
related to some previously specified one. The 
shifter is used for unpacking and packing float- 
ing point data, translating different decimal 
data formats, arithmetic shifts and rotations, 
and various other bit manipulations. As in the 
case of the ALU, the shifter's power is enhanced 
by the ability of the microcode to take any two 
words in the scratchpad, shift them, and store 
the result back in the scratchpad, all within the 
same cycle. 

/ Box Handles the Details 
The VAX architecture has a rich instruction set 
with a large number of opcodes and specifiers 
for fetching operands and storing results. While 
this variety is quite useful to the programmer 
and compiler writer, the task of decoding these 
opcodes and specifiers constitutes much of the 
total work in processing VAX instructions. 
Therefore, the 8600 has a separate subsystem 
dedicated to prefetching instructions, decoding 
them, fetching source operands, and storing 
results. That subsystem also receives condition 



codes from the E Box and makes all branch tar- 
get fetches and decisions. Much of the time, 
this work is ov erlapped with the actual instruc- 
tion execution in the E and F Boxes, thus 
achieving a high degree of simultaneous 
processing. 

The I Box consists of two major parts: an 
instruction unit and an operand unit (Figure 3). 
The instruction unit contains an 8-byte FIFO 
instruction buffer, which receives instruction- 
stream data from memory, 4 bytes at a time. The 
unit evaluates these bytes to determine the 
addressing mode and to make instruction 
optimization decisions. Evaluation is done with 
the help of a decode RAM, which contains 
information specific to the individual opcodes 
and specifiers. 

The instruction unit also supplies informa- 
tion about where to find the operands for an 
instruction. Using this information, the operand 
unit can generate the addresses for the oper- 
ands and start the memory reads to fetch them. 
For this purpose, the unit has its own copy of 
the GPRs, since they are needed to calculate the 
addresses. Often the GPRs contain the oper- 
ands, in which case either they are read directly 
or the numbers of the GPRs containing them are 
passed to the execution units (E and F Boxes). 
At other times, the operands are contained in 
the instruction stream itself, in which case they 
are extracted from the instruction buffer. 
Whenever possible, the instruction unit tries to 
process two specifiers in a single cycle by han- 
dling the second specifier as a GPR number. 
This optimization saves valuable cycles in fre- 
quently used instructions. 

When the E Box is ready, the I Box supplies 
the operands to it along with a dispatch address 
identifying the start of the microcode appropri- 
ate to the execution of the instruction. When 
execution is complete, the operand unit will 
provide the address for storing the result in 
memory. 

Therefore, the overall sequence of steps in 
performing an instruction is fetch instruction, 
decode instruction, generate address, fetch 
operand, execute, and store result. Any one of 
these steps for a given instruction may occur 
simultaneously with any other step for some 
other instruction. Of course, this is limited by 
the obvious restriction that no two operations 
can use the same resource (mem or}', register 
* file, etc.) simultaneously. Thus, for example, 
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while the 1 Box is decoding instruction 4, it 
may also be calculating addresses for instruc- 
tion 3 and fetching the operands for instruc- 
tion 2. If the operands are in the GPRs, then the 
current cache cycle may be used for fetching 
more of the instruction stream (say parts of 
instructions 6 and 7, with 5 already in the 
buffer). Moreover, any of these steps may be 
happening while the E !3ox is executing instruc- 
tion i. This overlapped processing, called 
"pipelining," greatly improves performance 
and is detailed later in the Pipelined Instruc- 
tion Processing section. 

Of course, there are bound to be hazards 
whenever work is done in parallel. The pipe- 
line cannot always operate at full speed due to 
conflicts produced by the various subsystems 
needing the same resources. Since several 
stages may be active simultaneously, the con- 
trol of each stage is intimately tied to the past 
and present operations in the other stages, as 
well as to those in the E Box and the M Box. 
Each stage attempts to process the available 
input data as quickly as possible. Whenever 
input is unavailable or a result cannot be stored 
immediately, a stage is said to be "stalled." One 
objective of the I Box, and of a pipelined struc- 
ture in general, is to minimize the time any 
stage spends in a stalled state as it can perform 
no useful work during that time. The execution 
unit will sometimes store a result in a register 
that is needed by the operand unit for the next 
instruction. A problem of this sort is resolved 
by using scoreboards and conflict detectors. In 
many cases, conllicts are avoided by passing the 
data as GPR tags, rather than passing the actual 
data. Fortunately, the VAX architecture nor- 
mally precludes writing into the instruction 
stream, so the instruction buffer can prefetch 
freely across most instructions. 

When appropriate, the I !3ox supplies all 
operands sign-extended and all floating point 
operands in memory format, independently of 
the source of the data. Therefore, the E Box and 
the F Box do not need to perform any special 
data manipulations before the data is used. In 
keeping with the principle of a high-speed, yet 
economical implementation, the VAX 8600 sys- 
tem uses the instruction buffer to fetch data for 
string and other multiple-operand instructions, 
thus using hardware that would otherwise sit 
idle. This procedure expedites large amounts of 
data through the processor without wasting 
cache cycles. This feature is especially 



important in commercial applications where 
data manipulation is more important than arith- 
metic speed. 

Since the 8600 is designed to run with the 
VMS operating system, the processor must be 
prepared to deal with memory exceptions dur- 
ing instruction execution. This procedure is 
complicated by multiple instructions being in 
the pipeline at the same time. For sorting things 
out, the operand unit has multiple program reg- 
isters that contain the starting addresses of all 
instructions in progress. A register log keeps 
track of GPR changes that must be undone 
should an instruction have to be repeated. 

M Box and Memory 

The mem or}' system includes the storage array 
boards and the M Box. This box contains not 
only all of the control, transfer, and error logic 
for the storage array, but also a data cache for 
fast access to memory data (Figure 4). Each 
array board contains 4 megabytes of MOS stor- 
age, and the memory backplane can hold eight 
boards for a maximum of 32 megabytes. The 
basic storage unit is a block of four 39-bit 
words, each with 4 data bytes and a 7-bit error- 
correction code. Special logic is included for 
writing bytes, significantly decreasing the stor- 
age access requirements. The M Box interfaces 
to and handles communication among the three 
major parts of the system, the main memory, the 
processor, and the I/O system (via the 
adapter bus). 

The cache is a high-speed memory with loca- 
tions that act as temporary substitutes for a 
selection of the most frequently used storage 
locations. The cache is two-way associative, 
meaning that for each address, the data can be 
stored in either of two locations. The total 
cache size is 16KB in two 8KB parts; its loca- 
tions are allocated in blocks of four words 
(16 bytes), addressed on a four-word boundary. 
In addition to the two data parts, there is a 
cache tag store containing the address bits for 
the blocks of data in the cache data store. For 
each block, the tag store also contains a valid 
bit and four written bits for the four words in 
the block. Associated with the data to ensure its 
integrity is an error code that enables the cor- 
rection of single-bit errors and the detection of 
double errors. 

The cache uses a writeback scheme for writ- 
ing in memory. This means that a word is not 
written in storage when it is modified, but only 
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when its cache location is needed for other 
data. In the interim, data is placed only in the 
cache, so a single cache location may be used 
many times without requiring access to the 
memory array (whose corresponding location 
becomes invalid). The contents of the cache are 
finally written in the array only when that 
cache location is needed to represent a differ- 
ent storage location. The replacement policy is 
"least recently used." That is, of the two blocks 
available to store a given piece of data, the one 
less recently accessed receives the new data. 
When a memory word containing a corrected 
error is placed in the cache, the written bit is 
turned on to force eventual rewrite of the stor- 
age location, thus reducing the probability of a 
double error. 



Addresses actually supplied to the cache or 
the memory array are always physical, and the 
direct memory access (DMA) references made 
by the I/O system always use physical 
addresses. There are three sources of memory 
references within the processor, each having its 
own port into memory: the instruction buffer, 
the operand unit, and the execution unit. Nor- 
mally these references are virtual, meaning that 
the addresses have to be translated from virtual 
to physical before they can be used to access 
the cache. When a virtual reference is made, 
the M Box microcode uses the high-order part 
of the address to index into the translation 
buffer (TB), itself a cache containing the most 
recently used translations. The entry from this 
buffer is then prefixed to the remaining bits of 
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the virtual address to form the desired physical 
address. The TB is one-way associative and has a 
capacity of 512 paging entries. Besides transla- 
tion information, it contains access-protection 
data, which aids in creating a secure operating 
environment. Refilling entries in the buffer is 
done from page tables in memory. 

Although the TB is located in the M Box, it is 
maintained by microcode running in the E Box. 
This provides an economical solution to the 
complicated task of keeping track of streams of 
references from the three ports. Each port can 
have two references in progress, since acces- 
sing the data cache and the tag store are over- 
lapped with accessing the TB. The data, 
addresses, and control information for these 
operations are carefully queued, with hand- 
shakes to allow the subsystems to proceed as far 
as possible (but not any further) while waiting 
for references to finish. Any memory exceptions 
encountered while prefetching instructions or 
operands are held off until the data is actually 
needed by the execution unit. That unit then 
deals with the problem, using memory refer- 
ences that bypass the normal queue, thus leav- 
ing it intact for restarting later. 

The result is a virtual memory system that is 
fast enough to allow a reference to complete 
during every cycle. With three subsystems mak- 
ing independent references, the high 
bandwidth of the bus, which allows that speed, 
can be well utilized. 

F Box Performs Floating Point 
For scientific and technical applications, the 
8600 has a floating point accelerator (FPA), the 
F Box, that operates in parallel with the E Box. 
The FPA receives operands over the oper- 
and bus from the I Box and delivers results over 
the write bus for storage in GPRs and memory 
(Figure 5). It performs floating point calcula- 
tions in all four VAX floating point formats, F, 
D, G and H (F numbers have 32 bits, D and G 
have 64, H has 128), and it also does integer 
multiplications. Usually the work involved in 
these calculations is split between the F Box 
and the E Box. The former does the arithmetic 
operations while the latter accesses memory for 
reading and writing operands, deals with 
exceptions, handles counters, and takes care of 
other chores. 

The E Box has a fairly general-purpose 
datapath, capable of dealing with the myriad 



tasks involved in executing the VAX instruction 
set. On the other hand, the F Box consists of 
specialized hardware (almost exclusively gate 
arrays) for doing only those steps needed 
in floating point operations. Hence, these oper- 
ations are executed in far fewer cycles. Further- 
more, the F Box cycles twice as fast as the other 
subsystems; its datapath is 32 bits, and mul- 
tiprecision operations are pipelined. The F Box 
also has its own copy of the GPRs, allowing the 
I Box to send both operands at the same time, 
one over the operand bus and one as an address 
for the GPR RAM. 

Much of the original challenge in F Box 
design lay in making it compact so as to mini- 
mize interconnect delays. Of its two modules, 
one contains the logic for floating point addi- 
tion, subtraction and division, while the other 
does floating point and integer multiplications. 
Both modules are microprogrammed, with each 
having its own microsequencer and control 
store. Moreover, the microcode is distributed 
among the various chips. This distribution 
enables a command to follow the data for sev- 
eral cycles and be repeatedly decoded as 
the floating point operation is executed. That 
allows normal operations to finish in a mini- 
mum number of cycles, while unusual condi- 
tions are detected and dealt with by the 
microcode. 

The multiplier module uses column reduc- 
tion and Booth encoding, together with a 
3-input adder, to produce a 40-bit partial prod- 
uct every half-cycle. The adder combines the 
operations of unpacking and aligning in a single 
shift, making it possible to produce an F format 
sum in only two cycles. Thus, ADDF2 takes just 
two cycles (as opposed to four in the 1 1/780), 
MULF2 takes four cycles, and each add-multiply 
step in a POLYF polynomial evaluation takes 
only six cycles. 

The VAX 8600 system continues a tradition of 
providing high-speed, accurate floating point 
performance. All operations are accurate to one 
half of the least significant bit. Any floating 
point exceptions cause the instruction to back 
up to its beginning. Then control is given to an 
exception handler, which scales the operands 
before resuming computation. By having all 
four formats available, intermediate calcula- 
tions can be done in a format with greater range 
and precision, thus avoiding exceptions and 
returning a more accurate result in composite 
operations. 
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Besides the basic operations of add, subtract, 
multiply and divide, the 8600 provides special 
instructions for argument reduction and poly- 
nomial evaluation. These instructions carry 
extra precision and also facilitate the high- 
speed software implementation of transcenden- 
tal and other sophisticated mathematical 
functions. 

System Microcode 

In addition to controlling the E Box datapath, 
the E Box microcode supervises the operation 
of the whole processor. Microcode initializes 
the system and tells the instruction buffer when 
to prefetch instructions or string data. Further- 
more, it starts and stops operand processing in 
the operand unit, maintains the address transla- 
tions in the TB, and orders the F Box to perform 
arithmetic operations. The microcode executes 
the full VAX instruction set, including recent 
additions such as G and H floating point, and 
interlocked queue instructions for multiproces- 
sing. Since it is backward compatible, the 
microcode also executes the PDP-l 1 instruc- 
tion set. 

Considerable effort was expended on opti- 
mizing the microcode and the E Box datapath to 
execute the VAX instruction set. The result is a 
relatively narrow microword of 84 bits (includ- 
ing two for parity), which nonetheless allows 
most high-frequency instructions to complete 
in a single E Box cycle. Having immediate 
access to all 256 scratchpad locations makes it 
possible to store decimal strings and other data 
structures internally, saving crucial instruction 
cycles. Low-frequency operations are imple- 
mented principally in microcode rather than in 
hardware to save board space and reduce cost. 

The E Box microcode is written in a straight- 
forward language that is easy to write, under- 
stand and debug. Of the 8K control store loca- 
tions, 7K are used for the system microcode. 
The remaining IK are available to the customer 
for implementing special functions, and 
4 1 hooks" arc provided for fast and easy access to 
user microcode. 

All subsystems have microcode; however, 
compared to the E Box, they all contain more 
specialized hardware and microcode to per- 
form fewer, but more specialized tasks. Even 
so, microcode still provides an economical, 
flexible alternative to hardware as a means to 
implement control. Wherever practical, nor- 
mal, high-frequency operations are done in 



hardware, whereas unusual operations are han- 
dled in microcode. 

Much of the error reporting and recovery is 
also implemented in microcode. If an error 
related to the currently executing instruction 
occurs, the microcode is trapped. It then col- 
lects the error information, fixes the error con- 
dition, backs up the affected instruction for 
later restart, and enters the machine-check 
software. 

Console 

The console, connected to all four of the boxes 
by a serial diagnostic bus, is actually an exten- 
sive subsystem based on a PDP-l 1 computer. 
The console monitors environmental and 
power-supply conditions, serves as the VMS 
operating system terminal, supplies a time-of- 
year clock, and provides an assortment of diag- 
nostic functions. Associated with the console 
are a local LA 100 terminal for use by the opera- 
tor, an RL02 removable disk for bootstrapping 
and diagnostic activities, and a remote diagnos- 
tic link. Bootstrapping is done automatically by 
the console, which serially passes microcode 
and initializing information to the various 
boxes over the diagnostic bus. The console and 
the E Box communicate via the console bus 
(C bus) to set up the I/O system and to imple- 
ment console functions such as examine, 
deposit, start, and halt. 

Input /Output System 

The I/O system provides input/output over a 
synchronous backplane interconnect (SBI) 
interfaced to the M Box via the adapter bus. 
This system offers complete compatibility with 
the myriad peripheral equipment currently 
available for the VAX-1 1/780 Family of 
machines. Moreover, the 8600 can have two 
SBIs, and its separate memory bus relieves them 
of any involvement in processor-memory trans- 
fers. Therefore, a significant increase in both 
the computational capacity and the I/O 
throughput of an existing VAX system can be 
gained simply by replacing only its processor 
with an 8600 and leaving the entire peripheral 
system in place. A single SBI can handle 
13 3 megabytes of data per second, all for 
input/output; two SBIs have a combined capac- 
ity of 17.1 megabytes. Some I/O device adapt- 
ers connect directly to the SBI; others must con- 
nect through a I NIBLS or MASSBUS. The 
theoretical maximum capacity of the adapter 
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bus is 33. 3 megabytes using two ultra high- 
speed adapters with transfers in 16-byte blocks. 

The latest I/O equipment is designed to be 
used with the computer interconnect (CI), 
which has a bandwidth of 70 megabits per sec- 
ond, and the Ethernet, which has a bandwidth 
of 10 megabits per second. The 8600 is the first 
VAX system to include the CI interface signals 
in its own backplane, providing as standard 
equipment the hardware necessary for its inclu- 
sion in a VAXcluster. The VAXcluster is a 
loosely coupled, multiprocessing environment 
of 16 nodes. Any node in the cluster can be 
either any member of the VAX Family, includ- 
ing another 8600, or an HSC-50 mass-storage 
controller. The HSC-50 controller provides 
intelligent, high-speed and shareable access to 
both disks and tapes for all the CPUs in the 
cluster; the maximum sustained data rate is 3.4 
megabytes per second. Each HSC-50 controller 
handles six data channels, and each channel can 
access four datapaths for either disks or tapes. 

The Ethernet can handle 1 ,024 stations with 
a maximum separation of 2,500 meters in a 
branching, unrooted tree. It is used in local 
area networks for communications between 
computers (such as DECnet service), unit- 
record equipment, workstations and the like. 

Performance Improvements 

The improved ability of the 8600 to execute a 
specific instruction, as compared with the 
1 1/780, can be determined by comparing the 
following factors: the shortening of the cycle 
time, the decrease in the number of cycles 
required, and the decrease in memory access 
time. Since the 8600 overlaps instructions, sim- 
ply comparing the speed of individual instruc- 
tions does not give a true indication of the abil- 
ity of the new VAX processor to perform an 
actual task. Because of the operational 
sequences chosen, even benchmarks often fail 
to give a complete picture of the improvement. 
This is true because the 8600 improves the 
speed of handling interrupt and exception 
functions even more than the speed of instruc- 
tion operations. And, of course, other quantities 
such as memory size and disk capacity also 
affect the comparative performance. 

In designing the VAX 8600 system, the basic 
performance objective was to increase the aver- 
age instruction execution speed by a factor of 
four. This objective was not only met but 
exceeded. The most significant features 



contributing to this performance improvement 
arc the following: 

■ The pipelined machine organization reduces 
by 40 percent the average number of 
machine cycles required per instruction. The 
I Box prefetches instructions and operands 
while the E Box is processing the current 
instruction. The address and data functions 
used to reference memory are also pipelined. 

■ The VAX 8600 cycle time is 40 percent of 
that of the 1 1/780 (80 versus 200 nano- 
seconds) and 60 percent of that of the 
1 1/785 (80 versus 135 nanoseconds). 

■ Faster and larger RAMs in the E Box allow the 
microcode to accomplish more processing in 
a single cycle. 

■ The cache uses a writeback strategy that 
eliminates unnecessary writes to memory. 

■ The two-way associative cache is twice the 
size of the cache in the VAX- 11/780 CPU 
(16KB versus 8KB). 

■ A dedicated memory bus with separate 
address and data lines eliminates contentions 
between memory references and I/O traffic, 
and between address and data transfers. 

■ Faster semiconductor technology decreases 
the gate delays for the 8600, as compared 
with the 11/780. Gate delays are 1 and 
3 nanoseconds, respectively. 

Pipelined Instruction Processing 
The solid boxes on the diagonal in Figure 6 
show the successive actions the processor takes 
to perform most instructions; that is, those that 
involve a single operation carried out on one 
pair of operands, represented in the instruction 
by the opcode and two operand specifiers. In 
small, low-speed computers, there is no pipe- 
lining. The processing, from fetching the 
instruction to storing the result, is performed 
for one instruction at a time. For example, the 
fetch of the next instruction does not occur 
until the result of the current instruction has 
been stored. The hardware devoted to each spe- 
cific activity is used only during that corre- 
sponding step and then remains idle until 
needed for the next instruction. 

Larger computers, like the VAX-1 1/780 sys- 
tem, shorten their execution times by prefetch- 
ing instructions: whenever a cache cycle is 
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available, the instruction box continues to 
prefetch more of the instruction byte stream 
from memory while activities for the previous 
instructions are proceeding. Thus, the next 
opcode is ready for decoding as soon as a result 
is stored. This simple level of pipelining 
decreases the total time required for getting the 
instructions. 

The 8600 carries the pipelining technique 
much further by pipelining the entire sequence 
of instruction activities shown in Figure 6. As 
indicated by the dashed boxes above and below 
the solid ones, the processor circuits for each 
type of activity are normally busy processing 
successive instructions. Of course, movement 
through the pipeline cannot always be at top 
speed. Various stages must sit idle whenever a 
cache miss requires waiting for data from main 
memory, or when a multiplication or division 
ties up the E Box for a whole string of cycles. 
Even the common instructions that take one 
cycle to execute still require a total of six 
cycles to complete (480 nanoseconds); a string 
of such instructions, however, can store a result 
in a register location during every cycle 
(80 nanoseconds) . 

As an example, consider the instruction 
ADDL2 (R0),R1, which uses two source oper- 
ands and stores the result in the location of the 
second. This involves the steps in the I, E and 
M Boxes outlined in the following steps. 



1. The I Box fetches ADDL2 from the 
instruction stream in memory. 

2. The I Box uses the opcode from ADDL2 
to address the decode RAM. 

3. The I Box gets the virtual address of the 
first operand from register R0 and sends 
it to the M Box. 

4. The M Box translates the virtual address 
into a physical address, retrieves the data 
from the cache, and sends it to the I Box. 
(If the cache does not have the data, the 
procedure must wait at this stage for the 
M Box to get the data from storage.) 

5 . The E Box receives operands from the 
cache and Rl, and adds them. 

6. The E Box stores the result in Rl , (If the 
result were to be stored in memory, the 
I Box would supply the address.) 

Reduced Memory Access Time 
Those factors that contribute most to reducing 
the memory access time in the 8600 are the 
dedicated memory bus, pipelined references, 
and greater cache hit rate. 

The dedicated memory bus has decreased the 
access time to the memory array by more than 
two thirds-the extra time taken for a cache miss 
is typically S00 nanoseconds, as opposed to 
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1600 for the 1 1/780. This happens for the fol- 
lowing reasons: 

■ The bus itself is faster than the SBI (80 versus 
200 nanoseconds) . 

■ There is no interference between memory 
and I/O traffic. 

■ Addresses and data are transferred simultane- 
ously rather than in sequence. 

All memory operations-addressing, data read, 
and data write-are pipelined in the 8600. 
Latency is still at least two cycles, one each for 
address generation and cache lookup, but a 
cache reference can be completed during every 
cycle. 

Finally, the cache hit rate of the 8600 has 
been improved simply by making its cache 
twice the size of the one used in the 1 1/780. 
Some time has also been saved by using the 
writeback strategy as compared with the write- 
through strategy of the 11/780. In write- 
through, both the cache and the memory array 
are updated on ever}' memory write. 

Technology Contributions to 
Improved Performance 

The processor cycle time has been reduced 
mainly by (a) using a faster semiconductor 
technology; (b) decreasing the wire length on 
both modules and backplanes; (c) using faster 
RAMs for the registers, cache, control storage, 
and memory array. 

The semiconductor technology in the 
VAX 8600 processor is emitter-coupled logic 
(FCL). This logic is nonsaturating; it is, there- 
fore, much faster than the VAX-1 1/780 transis- 
tor-transistor logic (TTL), in which state 
changes require either full charge or full dis- 
charge. The logic design takes advantage of the 
very fast ECL state changes because the effects 
of signal reflections were greatly reduced by 
minimizing interconnect delays, and wiring 
impedances were carefully controlled. ECL-TTI, 
conversion is needed to interface to the SBI, the 
console, and the memory array (which uses 
256K TTL-compatible MOS RAMs). The conver- 
sion is handled by dual-ported RAMs that serve 
as converting buffers; data goes in in one form 
and comes out in the other. 

Instead of the flipttops employed in other 
VAX systems, the VAX 8600 system mainly uses 
latches in its registers and control logic. 



Performance is improved because latches arc 
level sensitive, whereas flipflops can change 
their states only when clocked. In other words, 
no matter how quickly the inputs to a flipflop 
are set up, a new output configuration cannot 
be sent along to the next logic stage until the 
next clock. With a latch, however, the outputs 
can change when the inputs change, allowing a 
faster setup at the next stage. Despite the 
requirements for holding gating levels for some 
minimum time, this characteristic of latches is 
responsible for a reduction of about 10 
nanoseconds in the cycle time. Usually, more 
latches than flipllops are needed to implement 
a given logic function; latches, however, cost 
less than flipflops, so the cost per logic func- 
tion using either type of circuit is almost equal. 
Hence, the only real cost when using latches is 
the greater difficulty in performing timing anal- 
yses. Given the significantly increased perform- 
ance, this cost is well worth it. 

Macrocell Arrays 

Until now, the semiconductor industry has used 
three approaches to meet the demand for LSI 
digital circuits; standard, off-the-shelf circuit 
families; custom circuits; and gate arrays. Stan- 
dard circuits are economical but insufficient 
for the complex, specialized functions required 
by the 8600. Custom circuits, on the other 
hand, are quite expensive and take one to two 
years to design and produce. Fortunately, gate 
arrays have a shorter production time, since the 
basic array can be fabricated up to the point of 
metalization; unfortunately, the interconnect- 
ing metal makes the chip larger and increases 
the propagation delays. To circumvent these 
problems, Digital and Motorola created the so- 
called "macrocell array" approach to custom 
LSI. This approach decreases the cost and time 
to develop custom circuits and avoids many of 
the deficiencies of conventional gate arrays. 
Among the various technologies evaluated, the 
macrocell array best met the requirements of 
the 8600. 

As explained at the beginning of this article, 
the macrocell array is actually an extension of 
the gate array concept. Fach cell in the array 
contains a number of unconnected transistors 
and resistors that can be connected to form spe- 
cific logic functions or "macros." The cell 
library contains 85 macros: 54 for major cells, 
14 for interface, or input, cells; and 17 for 
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output cells. A single array can contain 106 
cells: 48 major, 32 interface, and 26 output. If 
full adders and latches are used in all cells, a 
single MCA may contain 1 , 192 equivalent gates; 
if flipflops and latches are used in all cells, it 
may contain 904. Typical power dissipation is 
5.0 watts, 4.4 milliwatts per equivalent gate. 
Contributing to the high performance of the 
system as a whole is the extremely low propaga- 
tion delay in major and interface cells: 1.2-1.8 
nanoseconds maximum, compared to 3-5-6.0 
nanoseconds for 1 OK ECL. The high density of 
100 gate equivalents per square inch, com- 
pared to 20-30 for MSI, is also important. 
Higher density reduces interconnect delays, 
thus further enhancing performance, and low- 
ers packaging costs as well. 

Reliability and Data Integrity 

Although we have not been able to eliminate 
hardware errors entirely, the VAX 8600 system 
goes a long way toward eliminating their effects 
on the user. Features are built into the 8600 at 
every level to guarantee the integrity of the data 
in the system and to promote its reliability, 
availability, and maintainability. These features 
range from minor characteristics within indi- 
vidual circuits to major provisions that embrace 
the entire system. Some of the more significant 
features are listed below. 

■ Inherent reliability is achieved through hav- 
ing a low component count, logic design for 
the worst-case situation, and high-reliability 
parts. 

■ Dynamic error reporting, by means of an 
error logger, aids in identifying the sources 
of intermittent failures. The error log is used 
for both hardware and software malfunctions 
and is kept in a disk file. 

■ Instruction retry is used whenever it is 
appropriate to the error type. For instance, 
four copies are kept of the general purpose 
registers. Therefore, on a GPR parity error, 
the instruction can be retried using a copy 
from the corresponding GPR in another box. 

■ Additional related software features include 
(a) automated patching and updating proce- 
dures; (b) powerfail-rcstart support; 
(c) user-mode diagnostics; (d) extensive 
protection facilities; and (e) dynamic mem- 
ory configuration to exclude bad pages. 



■ Single-bit error correction and double-bit 
error detection are used for the cache and 
the memory array, with automatic rewriting 
of the corrected word. 

■ There is parity checking at RAMs and buses, 
and parity continuity is carried through all 
major datapaths. Parity is kept not only for 
data, but also for physical addresses and the 
microcode. (Bad data in a control RAM or the 
control store is corrected by the console 
from its bootstrap files.) 

■ Address parity and a bad-data flag are 
"folded" into the error correcting code; 
thus, the storage words themselves contain 
information about error sources. 

■ There are separate selects to each memory 
array board, so the control logic for storage 
selection is all in one place, and faults can be 
isolated to an individual board. 

■ The memory battery backup has a capacity of 
ten minutes. The backup time can be set 
shorter to save on battery recharge time, thus 
allowing the alternative of riding out multi- 
ple short power failures by taking the chance 
of going down during a long one. 

■ Continuous self-testing is performed by the 
FPA when it is not in use. 

■ The system can be reconfigured without the 
FPA if floating-point failures arc 
experienced. 

■ There are fast, accurate diagnostics with first- 
failure fault isolation to the board. (Subse- 
quent depot-level servicing can isolate to 
within ten chips, on the average.) 

■ Signals can be monitored from the console 
via the diagnostic bus. 

■ An environmental monitoring module 
(EMM) gauges the physical operating envi- 
ronment of the system. The EMM measures 
temperatures and voltages and reports out-of- 
tolerance conditions to the console, which 
can shut down the system before permanent 
damage occurs. 

These features make it highly likely that 
errors will be detected and corrected, thus lim- 
iting their impact. If a transient error occurs, 
the instruction execution will pause and the 
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machine state will be saved in memory for 
processing by an error-analysis program that 
provides information to Field Service for quick 
on-site or remote repair. 

The hardware contains the various status flags 
used by the operating system to determine 
whether the instruction stream can be restarted 
following an error or some of the process con- 
text has been lost. Since most VAX instructions 
store results only upon completion, errors, in 
most cases, cause only intermediate results to 
be lost; the process can, therefore, be restarted 
at the specific instruction in which the error 
occurred. Sometimes an entire process will 
have to be stopped, although this will not affect 
the operations of other processes. In the worst 
case, some errors-infrequent, but overwhelm- 
ing-may require restarting the entire system. 
This strategy of graduated error catching and 
recovering, coupled with a technologically 
sound, worst-case design, creates a system with 
very high reliability and availability. 

The console is essentially a separate mainte- 
nance processor that runs the system for diag- 
nosing and isolating faults. By means of the 
serial diagnostic bus, the console can scan all 
signals needed for chip fault-isolation. (These 
signals are made available through multiplexers 
contained in the signal-terminator chips.) Also, 
the console keeps snapshot files of the long-run 
state of the machine. It has two programs to 
help system managers to avoid future difficul- 
ties. One program monitors the error log to 
warn of impending problems even if the system 
is recovering from current situations. The other 
program displays a graphic image of the system 
to highlight any faulty components; this is espe- 
cially useful in a fault-tolerant system, which 
will not crash to signal a component failure. 

Environmental Monitoring Module 
Devices for sensing various environmental con- 
ditions are located throughout the cabinet. The 
electronics and indicators associated with these 
devices are on the environmental monitoring 
module (EMM), mounted in the power-supply 
rack. In most cases, out-of-tolerance conditions 
are reported to the console for appropriate 
action. 

A principal environmental concern is over- 
heating in the logic, since the junction temper- 
ature in the MCAs directly affects their failure 
rate, which doubles with every rise of 20 
degrees Celsius. To guard against overheating, 



precision thermistors monitor the ambient tem- 
perature of the incoming air and the tempera- 
ture gradient across the card cage. By compar- 
ing the temperature of the inlet air with that of 
the air above the cage, the EMM can determine 
the temperature rise incurred by cooling the 
system logic. Should the inlet air temperature 
below the cage reach a preset value, the EMM 
will issue a warning to the console. If the inlet 
temperature reaches a danger-zone value or the 
gradient across the logic exceeds a prescribed 
amount, the EMM will issue another warning 
and, one minute later, will shut down system 
power unless the problem has been alleviated. 

Another important function of the EMM is 
measuring the output voltages of the power 
supply. Power-supply voltages must be the cor- 
rect values to ensure reliable system operation. 
If any of these is out of its operating range, the 
EMM will report the violation to the console. 
Voltages are measured continuously so that any 
out-of-tolerance conditions will be known and 
can then be reported to Field Service. 

Other environmental features include 
devices for detecting an overheated regulator, a 
failed blower, and inadequate air flow. Regula- 
tor overheating, whether due to faulty opera- 
tion or excessive ambient temperature, causes 
the closing of a thermal switch that shuts down 
the main power control. Unless accompanied 
by a temperature problem, other, less drastic 
failures are reported so that the system manager 
can resolve them. 

Besides its monitoring functions, the EMM 
controls power sequencing, both on and off. 
The computer has an electronic keying system 
that detects a board plugged into the wrong 
slot, and the EMM will not allow logic voltages 
to go on unless all modules are installed 
correctly. 

Packaging Innovations 

We had to make significant changes in the cur- 
rent levels of package design, from the semi- 
conductor devices to the cabinets, to capitalize 
fully on the new circuit technology. Therefore, 
we incorporated new techniques in intercon- 
nect, packaging and cooling in order to com- 
plement the semiconductor technology and to 
meet new environmental and safety regulations. 
These efforts were undertaken by Digital's own 
technology development team with, in many 
cases, the cooperation of other internal groups 
and external vendors. 
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Our efforts to meet the stringent density and 
electrical requirements at the device level led 
to the development of LSI packages that serve 
not only our needs, but also those of others in 
the industry. By employing extensive computer 
modeling of the system's thermal characteris- 
tics, we designed an integral heat sink that 
mounts directly on each MCA chip. At many 
critical locations, ICs are installed in high-relia- 
bility sockets that facilitate field repair. This 
decreases the system's downtime, a fact which 
helps to minimize the life-cycle cost of the sys- 
tem without jeopardizing its inherent 
reliability. 

Up to six layers of wiring are required to 
interconnect the devices mounted on a printed 
circuit board. This wiring is maintained at a 
controlled (transmission line) impedance to 
guarantee signal integrity. To ensure uniform 
cooling of the components, we used wind-tun- 
nel techniques to develop device placement 
algorithms, and computer analyses of each 
module design to provide thermal profiles of 
the integrated circuits. By implementing 
unique power connectors, rather than using 
many edge-connector pins in parallel, we 
gained sufficient signal pins for the density of 
components on the modules. In addition, the 
multivoltage bus bar that distributes power on 
the board also acts as a stiffener to maintain 
flatness. 

Both the modules and the backplanes they 
plug into are supported and located by a preci- 
sion, one-piece card cage that also acts as a 
plenum for the cooling air flow. The back- 
planes contain 16 layers of printed wiring in a 
laminated structure. To improve backplane 
reliability and ease of repair, all connectors are 
the solderless press-pin type; they utilize com- 
pliant pins to ensure long-term electrical con- 
tact to the circuit board. Power distribution is 
handled by large, copper bus bars for the pre- 
dominant voltages and by the cast backplane 
frame for ground return. Again, solderless 
press-pin technology is used to assemble power 
and ground connectors to the distribution sys- 
tem. Power-supply regulators are located above 
the logic assembly to facilitate power distribu- 
tion and to allow a straight, single-path air How. 
Along with acoustic treatment, this provides a 
simple, reliable cooling system that satisfies the 
latest regulations, including the noise limit rec- 
ommended for a computer-room. 



Special care was taken to design the system's 
cabling to ensure that, in most cases, cables are 
not disturbed when any logic or power mod- 
ules are removed. Furthermore, all external 
cables interface to an external bulkhead, both 
to facilitate rapid installation and to meet elec- 
tromagnetic radiation regulations. Cabinets 
were redesigned to improve site assembly and 
to help contain electromagnetic emissions. At 
the same time, backward compatibility with 
other VAX systems has been accomplished, so 
that previously purchased expansion cabinets 
can still be attached to the processor. Overall, 
an 8600 with over 16 megabytes of memory is 
actually smaller than a comparable 11/780, 
although the new machine does operate with 
one kilowatt more power. 
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The VAX 8600 I Box, 

A Pipelined Implementation 

of the VAX Architecture 

The VAX 8600 CPU has four times the perfomiame of the VAX 11/780 
CPU by using high-speed ECL technology and an internal organization 
ivith a four-stage pipeline. In this pipeline, up to four simultaneous 
instructions can be in several stages of execution at any time. At its heart 
is the instruction and operand fetch unit, the I Box. Under favorable 
conditions, the I Box can deliver one instruction evety 80 nanoseconds 
to the instruction execution unit, the E Box, yielding a peak execution 
rate of 12.5 MIPS. Special attention is given to the internal organization 
of this I Box as it differs from those in previous VAX implementations. 



The VAX 8600 computer system is the first 
pipelined implementation of the VAX architec- 
ture. 1 Like its nonpipelined predecessors, the 
VAX 8600 CPU implements the full VAX 
instruction set and runs under the VMS and 
ULTRIX operating systems. In addition, the 
VAX 8600 CPU provides higher performance 
and reliability than its predecessor, the 
VAX-1 1/780 CPU. 

In this context, the performance improve- 
ment factor needs to be clearly defined to avoid 
the confusion that usually arises when discuss- 
ing performance. First, let us define a given 
program's improvement factor as the time it 
takes to execute that program on the 
VAX-1 1/780 CPU divided by the time to exe- 
cute on the VAX 8600 CPU. The VAX 8600 
CPU's "true" measure of performance improve- 
ment is then the average of such improvement 
factors over all programs. Since the universe of 
all programs is too large, one has to select a 
proper subset of favorite benchmarks for the 
comparison. This subset of benchmarks can be 
labelled as the constant unit of work (CUW), 
and its selection is often the reason for conflict- 
ing reports in articles on computer perform- 
ance. The execution time of this CUW in our 
model is the product of three quantities: the 
number of instructions, the average number of 
cycles per instruction, and the cycle time of the 
machine under evaluation. 



The performance goal of the VAX 8600 pro- 
ject team was to reduce the average number of 
cycles per instruction from 10 (in the 
VAX-1 1/780 CPU) to 6, and also to reduce the 
cycle time of the machine from 200 
nanoseconds (in the VAX-1 1/780 CPU) to 80 
nanoseconds. In order to achieve the goal of 
reducing the cycle time of the machine, custom 
ECL gate arrays and standard 1 OK ECL logic 
were utilized throughout the design. This tech- 
nology improved the performance by 2 Vi 
times. The remaining performance gain of 1 Vj 
times was achieved by reducing the average 
number of cycles per instruction through the 
use of a four-stage pipeline. This pipeline is 
capable of overlapping the fetching of instruc- 
tion stream data with the decoding of instruc- 
tions, the prefetching of operands from mem- 
ory, and the execution of instructions. In the 
VAX-1 1 /780 CPU, on the other hand, the stages 
for the operand address calculation, operand 
fetch, and operand write are all merged into the 
execution stage. In the VAX 8600 CPU, up to 
four simultaneous instructions can be in several 
stages of execution at any one time. 

The remainder of this paper is organized as 
follows. First, a limited description of the VAX 
instruction set is presented. Then, an overall 
description of the VAX 8600 CPU internal 
organization is provided to familiarize the 
reader with the general environment of the 
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topic. Here definitions are given of the con- 
cepts, mechanisms, and building blocks that 
will be referenced in the examples of the pipe- 
line model. Further on, an abstract model of 
pipelines is introduced and a description of the 
VAX 8600 CPU in terms of such a model is 
presented. Finally, the details of the internal 
organization of the instruction unit (I Box) and 
its associated control structure are presented, 
including an example of a section of code flow- 
ing through the pipeline. 

The VAX Instruction Set 

The VAX architecture has a rather rich and 
powerful instruction set. Each instruction, in 
general, consists of one byte of opcode, option- 
ally followed by one to six operand specifiers. 
These specifiers can represent the accessing 
scheme for an operand, the displacement in a 
branch instruction, or the target address in a 
call type of instruction. The data type and usage 
of each specifier is derived from the opcode. 
There are also two-byte opcodes for mu Impreci- 
sion floating point operations, instruction set 
extension, and user-defined operations. The 
instruction set is standardi'/ed so that each VAX 
implementation is able to execute the same 
software image as well as the same operating 
system environment. This compatibility is the 



basic goal for all VAX implementations, includ- 
ing the 8600. 

The VAX 8600 Environment 

Functionally, the CPU (Figure 1) consists of 
four separate microcoded units for memory and 
I/O (iM Box), for instruction fetches and prepa- 
rations (I Box), and for instruction execution 
(E Box and F Box). The F Box is a coprocessor 
for high-speed floating point execution. These 
subsystems and their interconnecting buses are 
now described. 

M Box -The Center of System 
Co m m u n icatio n 

The primary purpose of the M Box is to link the 
main memory, the cache, the CPU ports, and 
the I/O subsystem. In this capacity, the M Box 
is the communication center at the system 
level. 

The M Box contains a physical cache for 
instructions and data and a virtual address trans- 
lation buffer (TB). It also has exclusive access 
to the memory array. These resources are 
accessed by three fixed-priority CPU ports and 
an I/O port, as shown in Figure 2. The M Box, as 
the system communication center, must con- 
tend with several concurrent activities requir- 
ing communication services. To cope with 
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these numerous requirements, the M Box is 
heavily microcoded and occasionally calls 
upon E Box microcode to assist with some 
memory management functions. The M Box has 
the capability of queuing a number of memory 
requests from both the instruction fetch and 
execution units. Both the I Box and E Box can 
request M Box service through their own mem- 
ory ports and buses. 

A more detailed description of the M Box can 
be found in reference 2. 

/ Box-The Heart of the Pipeline 
The primary purpose of the I Box is to continu- 
ously feed microcode dispatch addresses and 
operands to the E Box and F Box so that they 
may execute the VAX instruction set. To do 
that, the I Box must prefetch the instruction 
stream from the M Box and then interpret it: 
parse the specifiers, fetch the operands and 
build the dispatch address (Efork) for the 
E Box. Three of the four pipeline stages, includ- 
ing a microcoded operand address calculation 
engine, are used to implement these functions 
at high speed. Extensive control logic is needed 
to synchronize the flow of data and control 
through the pipeline. Furthermore, the I Box 
contains the logic to maintain the many pro- 
gram counters representing the different 
instructions executing concurrently in the 
pipeline. 



The virtual ownership of the pipeline, 
including the critical E Box dispatch interface, 
the control of most of the CPU-to-M Box inter- 
face, and the maintenance of the program coun- 
ters, makes the I Box the heart of the pipeline 
and the object of much of the complexity of the 
VAX 8600 CPU. 

E Box and F Box-The Essence of the 
VAX Architecture 

In general, the E Box and F Box consume the 
dispatch addresses and operands set up by the 
I Box and perform only the operations as speci- 
fied in the opcode of a macroinstruction. In this 
way, these boxes are isolated from memory 
access and freed from specifier evaluation and 
operand fetching. They can thus be optimized 
for high-speed execution. The E Box also per- 
forms the secondary function of managing the 
boundary conditions for both the hardware 
(machine checks, such as single- and double-bit 
memory errors and parity errors) and the VAX 
architecture (interrupts and exceptions). In 
particular, most memory management bound- 
ary conditions are handled by the E Box. TB 
misses, page faults and access violations, page 
crossings and unaligned E Box memory refer- 
ences are detected by the M Box but are all 
serviced by the E Box. In this respect, the exe- 
cution units are the essence of the VAX 
architecture. 
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System Buses 

There are a number of internal buses that are 
key to the organization of the VAX 8600 CPU 
and to understanding it. These include the 
following: 

1 . IVA bus-I Box virtual address bus, which 
carries virtual addresses from the I Box 
to the M Box during instruction fetch, 
operand fetch, and I Box-write 
operations 

2. iMD bus-Memory data bus, which carries 
data for both reads and writes to the 
iM Box subsystem 

3. OP bus-Operand bus, which carries 
operands from the I Box to the E Box and 
F Box 



the concurrency is microprogrammed. (See 
Figure 3a.) 

Most existing VAX implementations have 
added a stage for instruction prefetch, thus 
reducing the instruction fetch latency; the 
prime example is the VAX-1 1/780 CPU. (See 
Figure 3b.) 

The VAX 8600 CPU is the first implementa- 
tion of the VAX architecture that separates 
instruction preparation (for example, effective 
address calculations and operand fetches) from 
instruction execution itself. (See Figure 3c.) 

The significance of the VAX 8600 design lies 
in the successful resolution of the implementa- 
tion difficulties that stem from the combined 
complexities of the VAX architecture and the 



4. W bus-Write bus, which carries results 
from the execution units to memory (via 
the I Box) or to the general purpose reg- 
isters (GPRs) 

5. EVA bus-E Box virtual address bus, 
which carries virtual addresses from the 
E Box to the M Box during E Box operand 
references and certain memory manage- 
ment routines 

6. A bus-I/O bus, which interfaces the CPU 
to the I/O subsystems 

So far we have briefly introduced the funda- 
mental building blocks of the VAX 8600 CPU. 
We will now analyze it from the more abstract 
level of its microarchitecture, that is, its pipe- 
line structure. To this end, a model of pipelines 
is first developed. 
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The Pipeline Model 

Pipelined computers are not new. From the 
early days of the IBM Stretch^ and the IBM 
360/9 f l to the scalar units of the CDC S and 
CRAY*' machines, pipelining has been a proven, 
if expensive, method for performance enhance- 
ment. Such enhancement is achieved by replac- 
ing the sequential execution of each instruc- 
tion step in a single functional unit, with the 
concurrent execution of some or all steps in 
multiple functional units. 

In most Von Neumann processors, the 
instruction fetch and decode functions are per- 
formed sequentially in the only ''stage," the 
execution unit, which is also the entire CPU. A 
typical example is the PDP-1 1 system, in which 
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pipeline approach: the more complex an archi- 
tecture (that is, the greater the control and data 
dependencies), the more difficult it is to pipe- 
line it. 

While the basis and fundamentals for such 
designs can be found in references 7 and 8, and 
a more recent pipeline model is discussed in 
reference 9, we present here a simplified 
model for the purpose and scope of this paper. 
Then, using such a model, we describe the 
VAX 8600 pipeline. 

An Ideal Pipeline Model 
In this section we define a simple model of a 
pipeline. Examples from the section on the 
Simplified VAX 8600 Pipeline Model, 
described later in this paper, are used to illus- 
trate the abstract concepts presented in this 
section. 

Let us define a pipeline stage, depicted in 
Figure 4a, as an entity with four fundamental 
attributes: function, hardware residency, prece- 
dence, and the number of stage elements. 

The function of a stage is usually an input 
buffer, an output buffer, and a mapping 
between the two. For example, the function of 
the operand access unit (OAU) stage is to com- 
pute an operand effective address, fetch it from 
the M Box, and then load it into the output 
buffer, the I Box memory data (IMD) register. 

The hardware residency of a stage is where it 
resides in the hardware. For example, the OAU 
stage resides in the I Box hardware. 

The precedence of a stage is its position in 
the sequence of stages. This precedence is fixed 
and means that the instruction decode stage, for 
example, is a successor of the prefetcher stage. 
Note that the precedence relation is a logical 



concept and not a physical one. For example, 
although the memory write function of the exe- 
cution stage is part of the last stage of the pipe- 
line, it shares resources with the OAU stage. 

Finally, a stage function is implemented by 
one or more elements. Under optimal condi- 
tions, an element processes an item in one 
physical cycle. However, more than one physi- 
cal cycle may be needed when the function that 
the element implements is a complex one, or 
when the element has to wait for certain 
resources. 

Let us now define a few concepts that are key 
to the understanding of the pipeline model. 

The logical cycle of a stage is the number of 
physical cycles needed to process an item. 
Under optimal conditions, this number is usu- 
ally the same as the number of elements in the 
stage. The reason for this distinction between 
logical and physical cycles will become appar- 
ent with the following examples. 

1. In the first example, the OAU stage 
processes a simple specifier, such as reg- 
ister deferred mode (Rn). In this case, 
one logical cycle equals two physical 
cycles: one to compute the operand 
address and another to fetch the operand 
itself. 

2. As a second example, consider again the 
OAU stage's processing of a complex 
specifier, such as Iongword displace- 
ment deferred indexed, (ci)LD(Rn)[Rx], 
with a cache miss in the indirect refer- 
ence. In such a case, one logical cycle 
will equal N physical cycles, where N is 
directly dependent on the state of vari- 
ous system resources. 
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A pipeline is a sequence of stages connected 
by "transport" mechanisms, which move an 
item from the output buffer of one stage to the 
input buffer of the next. Except for the first and 
last stages, such a structure can be partitioned 
into a current stage, all its precedent stages, and 
all its subsequent stages. One can also define 
the predecessor stage as the immediately pre- 
ceding stage, and the successor as the one 
immediately following. 

What has been described so far is the 
datapath of a pipeline. 

Control of the Ideal Pipeline 
The datapath of the pipeline model just dis- 
cussed is a somewhat simple concept that 
belies the complexity of the mechanisms 
needed to control it. In the ideal case shown in 
Figure 4a, the relatively simple synchronization 
is based on "local control" and is implemented 
by the stall conditions defined below. 

Local control is defined as the control of a 
flow of items through the pipeline by flags that 
arc transported together with the items. These 
are the valid flags of the input and output buff- 
ers. The two basic operations of loading and 
draining can give such flags the values of cither 
"empty" or "full." These two values arc called 
also "invalid" and "valid" respectively. 

Loading occurs at the completion of a logical 
cycle, when a stage writes an item into its out- 
put buffer and sets the buffer's valid flag to full. 

Draining occurs at the beginning of a logical 
cycle, when a stage reads an item from its input 
buffer and sets the buffer's valid flag to empty. 

Depending on the operation and on the val- 
ues of these flags, one of two stall conditions 
can occur. 

1 . An input stall takes place when the valid 
flag of the input buffer is empty and the 
stage wants to drain it. Then the stage 
must avoid loading the output buffer, 
since it would be loaded with invalid 
data. 

2. An output stall takes place when the 
valid flag of the output buffer is full and 
the stage wants to load it. The stage must 
then stop to avoid data overrun. 

Even in the case of an ideal pipeline, an 
important performance issue is that of elasticity 
of the pipeline. Elasticity is the ability of the 
pipeline to deliver results at full bandwidth in 



spite of its irregularity. Irregularity results 
when different stages in the pipeline have logi- 
cal cycles of different duration; hence the time 
to process an item in each stage is variable. 

Rigidity, the reciprocal of elasticity, mea- 
sures the dependence of a stage on the stalled 
state of another stage. In other words, the rigid- 
ity is related to the speed with which the stall 
flags "ripple" through the stages, in cither 
direction. Rigidity is counterproductive in that 
it stifles concurrency. For that reason, extra 
buffering is sometimes used; this allows a stage 
to execute even if some output buffers are 
already full, thus reducing output stalls. This 
also means that the input buffers to the succes- 
sor stage will be able to be "preloaded," thus 
reducing input stalls as well. 

However, simple FIFO extra buffering may 
introduce the negative effect of increasing the 
pipeline latency (that is, the number of physi- 
cal cycles needed by an item to travel through 
the entire pipeline). This effect can be mini- 
mized by the use of "bypass" circuitry, as 
described in reference 9, at the cost of a signifi- 
cant amount of control complexity. To mini- 
mize such complexity, one can reduce the 
number of input and output buffers in a stage to 
just one output buffer. In this case the single- 
stage buffer functions both as the output buffer 
of that stage and as the input buffer of the suc- 
cessor stage. The VAX 8600 design is very close 
to this model. 

A Model with Pipeline Dependencies 
All pipeline models have embedded, via the 
precedence attribute, the "trivial" dependency 
of a stage; that is, its dependence on the output 
buffer of its predecessor stage. However, a 
more realistic pipeline model (see Figure 4b) 
must include nontrivial dependencies as well, 
that is, dependencies of a stage on other than 
the output buffer of its predecessor stage. Such 
dependencies can be classified according to 
their type (data or control) and direction (for- 
ward or backward). 

A stage has a data dependency if it needs data 
values produced by a stage other than the pred- 
ecessor stage. For example, the OAU stage must 
wait until the E Box has updated a GPR before it 
can use that GPR in the address calculation, as 
shown in Figure 4c. 

A stage has a control dependency if it needs 
control produced by a stage other than the 
predecessor stage. For example, the OAU stage, 
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which also processes branches, must wait until 
the E Box has generated the condition codes for 
the instruction preceding the branch. Once the 
condition codes are set, the OAU can resolve 
the branch, as shown in Figure 4c. 

Each of these dependencies can operate in a 
forward or backward direction. In a backward 
dependency, a piece of a data or control item 
affects a precedent stage. Either example for 
the control or data dependency illustrates the 
point. In a forward dependency, a piece of a 



data or control item affects a subsequent stage. 
An example is the I Box-write address depen- 
dency, which is described in the next section. 

In addition to the above, there are resource 
dependencies, which occur when a stage needs 
to use a resource shared among many stages. 
The memory in the M Box, for example, is a 
resource shared by three of the VAX 8600 
stages. 

All of these dependencies make the imple- 
mentation of a pipeline more difficult than in 
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the ideal case. However, they sometimes allow 
a more efficient global control of the pipeline. 
This is the control of the flow of items through 
certain stages by key flags that are broadcast by 
another stage. Note that this mechanism oper- 
ates in conjunction with the local control. 

In the next section the concepts just intro- 
duced will be used to represent the VAX 8600 
CPU in terms of the simple abstract pipeline 
model just described. 

The Simplified VAX 8600 Pipeline 
Model 

A simplified model of the datapath portion of 
the VAX 8600 pipeline is shown in Figure 4c. 
In this model the F Box is not shown, as its 
locus of control is very similar to that of the 
E Box. This four-stage design has two funda- 
mental resource dependencies, which embody 
much of the logic to control the pipeline: the 
M Box, which is used by the instruction 
prefetch (prefetcher) stage and sometimes by 
the OAU and execution stages; and the GPRs, 
which are used normally by the OAU and exe- 
cution stages. 

Before discussing the simplified model, let us 
follow an instruction as it goes through the 
pipeline. 

At the beginning of instruction processing, 
assume that all the I Box buffers are invalid. In 
this case the E Box dispatches the instruction 
prefetcher at the new instruction stream 
address. The prefetcher stage starts prefetching 
and loading instructions into the instruction 
buffer (Ibuffer). This is actually a simplifica- 
tion; the detailed mechanism is described in 
the Instruction Prefetch section. The instruc- 
tion decode stage, called the decode stage, 
drains the Ibuffer and from the opcode gener- 
ates a dispatch address (not shown in 
Figure 4c) for the E Box microcode. The oper- 
and address calculation unit (ACU) element in 
the OAU stage parses the operand specifiers and 
computes their effective addresses, in the pro- 
cess reading and possibly modifying the GPRs 
(e.g., autoincrement mode, (Rn)+). The oper- 
and fetch (Opfetch) element fetches these 
operands at that effective address and passes 
them to the E Box. The E Box then executes the 
instruction it was dispatched to; in doing so, if 
the destination is a register, it drains the oper- 
ands and writes the result into the GPRs. If the 
destination is memory (and only in that case), 



the memory write (Mem-write) element is 
used. It takes the result data from the E Box and 
writes it to memory via the operand port (see 
Figure 2) at the address forwarded by the ACU 
element. Such a mechanism is called an 
/ Box- write. 

Let us now look at each stage of the pipeline 
of Figure 4c in more detail. 

The prefetcher stage is composed of the 
instruction address calculation (Iaddr) element 
and the instruction fetch (Ifetch) element. The 
Iaddr element computes the next value of the 
virtual instruction buffer address (VI BA) regis- 
ter and issues an Ibuffer request. The Ifetch 
element fetches a longword from the address 
pointed to by the VIBA register and loads it into 
the Ibuffer. The prefetcher stage resides in the 
I Box. Its logical cycle lasts two physical cycles 
in the case of a cache hit, or N physical cycles 
otherwise, where N depends on the memory 
access delay. 

The decode stage is composed of only one 
element and its logical cycle always lasts one 
physical cycle. It decodes opcodes and specifi- 
ers from the Ibuffer and loads control data into 
the Ifork buffer (the Ifork will be defined in the 
Instruction Decode section) and instruction 
stream data into the data multiplexer (Dmux) 
buffer. The Ifork and Dmux buffers together 
form the output buffer of the decode stage. The 
decode stage resides entirely in the I Box. 

The OAU stage is composed of the ACU and 
Opfetch elements. The ACU element computes 
an operand effective address, loads it into the 
virtual address (VA) register, and issues an 
operand request. The Opfetch element fetches 
the operand from the M Box and loads it into 
the I Box memory data (IMD) register. The OAU 
stage also forwards the VA to the Mem-write 
element. Note that this stage can contain two 
instructions at any given time. The OAU stage 
resides in the I Box, and its logical cycle lasts a 
minimum of two physical cycles. 

The execution stage is composed of the E Box 
and the Mem-write elements. The E Box cle- 
ment executes instructions and stores results 
into either the GPRs or the write latch for mem- 
ory writes. In the latter case it initiates an I Box- 
write command. The Mem-write element actu- 
ally performs the write operation at the address 
forwarded by the VA register in the OAU stage. 
The execution stage resides in the E Box, F Box, 
and partially in the I Box for memory writes. Its 
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logical cycle lasts a minimum of one physical 
cycle; for example, in the case of register desti- 
nation instructions, such as MOVL (Rx),Ry. It 
will last at least three physical cycles in the 
case of memory destination instructions, such 
as MOVL Rx,(Ry), or even longer in the case of 
complex instructions. 

In the simplified model each stage has only 
one output buffer, which functions also as the 
input buffer of the successor stage. Thus a drain 
operation is implemented as an interstage drain 
signal. Note that in this case the elasticity of the 
pipeline is reduced to a minimum. In the worst 
case, if the pipeline is full and the last stage 
stalls, then all the stages in the pipeline will 
stall. 

Moreover, since a stall condition must be 
detected before loading the output buffer, the 
output stall condition is more stringent in cer- 
tain cases, as defined below, than the one intro- 
duced earlier. The output stall is here defined 
as a condition in which the valid flag of the 
output buffer is full and the stage wants to load 
it, AND the successor stage is not draining it. 
This means that the stage will stall less fre- 
quently. However, note that the input stall con- 
dition remains the same as defined earlier. 

In such a model there are some interesting 
examples of nontrivial dependencies. 

■ The prefetcher stage has two backward con- 
trol dependencies, the decode and OAU 
stages, that affect the I buffer requests to the 
M Box. The issuance of such requests by the 
prefetcher stage requires the knowledge of 
the validity of the decode stage's output 
buffer and also whether or not the OAU stage 
is draining it. These two dependencies are 
fundamental because they take the place of 
the prefetcher stage's trivial dependency on 
its predecessor stage, which does not exist. 

■ The OAU stage has a backward data depen- 
dency, the execution stage, that affects its 
ability to resolve branches. The OAU stage 
must wait for the condition codes from the 
E Box, after completion of the instruction 
preceding a branch, in order to resolve it and 
start prefetching at the target address. 

■ The execution stage has a forward data 
dependency, the OAU stage, when they 
together execute an I Box-write command. 
In this case the OAU stage forwards the desti- 
nation address to the Mem-write element (as 



far as hardware is concerned, the address 
stays in the VA register). When the ACU takes 
many cycles to compute the effective 
address, the I: Box may have to wait for the 
disposing of the data. 

The VAX 8600 I Box 

The three pipeline stages residing in the I Box 
are physically composed of the following 
structures: 

■ An instruction prefetch stage (prefetcher in 
Figure 4c), which prefetches the instruction 
stream for the I buffer. (This stage is also 
used to fetch string data in string 
instructions.) 

■ Decoding logic, which produces dispatch 
addresses, based on opcode and its specifi- 
ers, for the operand address calculation unit 
micromachine and the E Box. (This is the 
decode stage as defined in the pipeline 
model.) 

■ A micromachine, called the ACU 
micromachine, which implements the func- 
tionality of the OAU stage and part of the 
Mem-write clement. (This functionality 
includes operand address calculations, oper- 
and fetches and results writes.) 

Notice that part of the Mem-write element 
resides in the I Box. This part maintains the 
memory write address for results operands and 
shares responsibility with the E Box element to 
perform the actual results write. 

Furthermore, the I Box maintains the follow- 
ing items: 

■ Program counters for tracking different 
instructions being executed at different 
stages in the pipeline 

■ A local copy of the GPRs for operand effec- 
tive address calculations and operand 
sourcing 

■ A register scoreboard for resolving register 
access conflicts 

■ A register log (Rlog) for register state restora- 
tion during exceptions and interrupts 

■ A branch decision mechanism 

■ Control mechanisms to synchronize the 
pipeline 
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The importance or' the VAX 8600 I Box lies in 
the many functions it has to perform and the 
extensive controls required to correctly syn- 
chronize all tour stages of the pipeline. 
Figure 5 depicts the datapath of the I Box. The 
following sections describe the functions of 
many of its features. 

Instruction Prefetch 

The prefetcher has an eight-byte Ibuffer and 
associated addressing and control logic. It 
attempts to initiate a prefetch whenever an 
"empty" byte is detected inside the Ibuffer. 
The YTBA register contains the next address in 



the instruction stream to be fetched from. 
Prefetch request addresses share the IVA bus 
with the ACU via the Ibuffer port and operand 
port respectively. (See Figure 2.) Since an oper- 
and fetch is a result of executing an already 
decoded instruction, it has a higher priority in 
using the IVA bus. Prefetches, on the other 
hand, can be postponed and thus have lower 
priority. 

The memory subsystem queue can accept a 
second prefetch even if a previous prefetch is 
still in progress. This mechanism results in bet- 
ter utilization of the available memory 
bandwidth. Data received through the MI) bus 
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is loaded into the appropriate location inside 
the Ibuffer. The VIBA register is updated to 
form the next address whenever a prefetch 
request is accepted by the M Box. 

During a cold start, after an exception, or for 
certain branches (such as the CASE instruc- 
tion), the prefetch sequence must start from a 
new instruction address. In this instance the 
E Box places the new address on the W bus and 
dispatches the ACU micromachine to an I Box 
startup sequence. Instead of loading the address 
to the VIBA register and starting the prefetching 
process, the ACU micromachine initiates two 
consecutive requests before handing the con- 
trol of the prefetching process over to the 
prcfetcher stage. 

For some instructions requiring stream data 
(e.g., character string instructions) or a stream 
of operands (e.g., the popping of the GPRs 
from the stack in the RET instruction) to be 
read consecutively from memory in their exe- 
cution, the Ibuffer becomes a high-speed data 
buffer supplying operands to the E Box through 
the OP bus. 

Instruction Decode 

Instruction decoding in the VAX 8600 CPU is 
similar to that in the VAX- 11/780 CPU, in the 
sense that the operand specifiers are decoded 
sequentially. When the Ibuffer contains 
prefetched instructions, byte zero contains the 
opcode of the current instruction, and byte one 
the first byte of the specifier currently being 
decoded. An instruction is decoded by looking 
up information from a decoding RAM (DRAM), 
which is organized as an array of 512 blocks, 
each of which has eight entries. Each entry is 
addressed by its block and entry index. The 
opcode byte plus an extended opcode, if there 
is one, will address the block. The execution 
point counter (EPC), which is a pointer indi- 
cating the position of the current specifier in 
the instruction, will select the particular entry. 
The output of the DRAM consists of information 
specifying the data context (byte, word, 
longword, etc.), data type (address, integer and 
different floating point formats), and accessing 
mode (such as read, write or modify) for each 
specifier. It also provides the Efork dispatch 
address to the E Box. 

After each specifier decode, the Ibuffer shifts 
out the consumed specifier and shifts the next 
specifier into the decoding position. The 
decode stage also increments the EPC so that 



the new decode points to the next DRAM entry. 
The output of the DRAM plus data extracted 
from the specifier field in the Ibuffer, such as 
GPR information and literal value, is buffered 
for the OAU stage. 

Using the specifier byte during decoding, a 
dispatch generation mechanism creates a dis- 
patch address, called Ifork, for the ACU 
micromachine. This process will continue until 
the last specifier of the instruction is decoded 
and consumed. (A bit in the DRAM output will 
indicate such an occurrence.) When this hap- 
pens, the Ibuffer shifts out byte zero and the 
last specifier, thus allowing a new instruction 
to be shifted in. 

To clarify the concepts above, note that an 
Efork dispatch is generated from the opcode. 
The dispatch is then given to the E Box to point 
to the E Box microflow that implements that 
instruction's algorithm. A similar mechanism is 
used to process specifiers. An Ifork dispatch is 
generated from each specifier and is given to 
the ACU micromachine to point to the ACU 
microflow that implements that specifier's 
algorithm. 

The ACU Micromachine 
With reference to the simplified pipeline 
model (Figure 4c), the ACU, Opfetch, and 
Mem-write elements are described here 
together. In this way, their functionality and 
synchronization mechanisms can be better 
understood. The Ifork saved in the decode stage 
provides the entry to the proper microsequence 
routine in the ACU micromachine. Using the 
buffered DRAM and specifier data, the ACU 
micromachine performs the necessary compu- 
tations to calculate the effective virtual address, 
and to initiate operand reads from memory or 
from the GPRs, if necessary. A copy of the 
GPRs, which is also called a GPR file, is main- 
tained in the I Box so that register access can be 
done locally, which is faster. This also allows 
simultaneous register accesses (reads) by the 
I Box, E Box, and F Box. 

For an operand that comes from a register 
source, data read from the GPR file, after 
passing through the ACU adder, will be loaded 
into the I Box data (ID) register. Immediate 
data, which comes from the Dmux buffer in 
the decode stage, takes a similar route through 
the unpack logic to the same ID register. 
The operand data is then ready for the E Box 
via the OP bus. The unpack logic is used to 
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convert fixed point short literals to a floating 
point format. 

For an operand fetch from memory, the ACU 
micromachine loads the operand effective vir- 
tual address from the adder into the VA register 
and issues an operand fetch request through the 
IVA bus. The IMD register holds any operand 
data returned from the M Box before forwarding 
it to the E Box through the OP bus. If the 
addressing mode is indirect (e.g., autoincre- 
ment deferred), the returned data in the IMD 
register will be the final virtual address of the 
operand. Then, the ACU micromachine loads 
the IVA bus with the IMD register data and 
issues another operand fetch request. The E Box 
memory data (EMD) register serves a similar 
function, but holds memory data returned as a 
result of E Box requests. Placing the EMD regis- 
ter physically in the I Box eliminates the need 
for the E Box to interface with the MD bus 
directly. 

The ACU microsequences for many simple 
and frequently used specifiers take one cycle, 
so that one specifier can potentially be 
processed in each cycle. Some examples of 
such specifiers are (a) the register mode, Rn; 

(b) the register deferred mode, (Rn); and 

(c) byte, word, and longword displacement 
modes, B A D(Rn), W A D ( Rn ) and L A D ( Rn) 
respectively. The successful processing of an 
operand specifier in the OAU stage also loads 
the earlier buffered Efork into a register accessi- 
ble by the E Box. 

The logical cycle of the OAU stage may take 
many physical cycles. This may happen if the 
algorithm that implements the addressing mode 
is a complex one, or if the operand fetch is 
from memory 7 and it results in a cache miss. In 
this case the execution stage may have already 
started executing the Efork microsequence, 
thus attempting to read and use the source 
operand, which is not yet available. To resolve 
this, the OAU stage provides additional operand 
data-valid flags. 

The ACU micromachine also issues the actual 
operand write request for most instructions. In 
this case the micromachine saves the calculated 
destination address and waits until operand 
results are ready from the E Box. When the 
results are ready, the E Box will write them, via 
the W bus, into a register called the write latch, 
internal to the I Box. The E Box also releases 
the ACU micromachine to issue the appropriate 
operand memory write request. 



Multiple Program Counters 
The VAX 8600 CPU maintains a number of pro- 
gram counters for each of the instructions 
under execution in the pipeline. This is neces- 
sary so that instruction restart is possible after 
an exception service sequence. The program 
counters consist of the following items: 

■ Program counter (PC), which points to the 
opcode, operand specifier, and immediate 
data or addresses as they are decoded. 

■ Current program counter (CPC), which 
points to the instruction to be executed next 
in the OAU stage. Normally, this is the 
instruction currently being decoded. 

■ I Box starting address (ISA), which points to 
the instruction being executed in the OAU 
stage. 

■ E Box starting address (ESA), which points to 
the current instruction being executed in the 
E Box and F Box. 

The prefetcher maintains its own instruction 
stream address pointer, the VIBA register, for 
requests to fill the Ibuffer. 

The updating of the CPC, ISA, and ESA hap- 
pens when an instruction enters the decode, 
OAU, and execution stages respectively. In gen- 
eral, the CPC will be loaded with the address of 
the beginning of the instruction to be decoded. 
The ISA will be loaded with the CPC when the 
OAU has started processing that instruction. 
Similarly, the ESA will be loaded with the ISA 
when the E Box begins to execute that same 
instruction. 

Instruction Backup and Unwinding 
In the VAX architecture, an exception may 
occur during the execution of an instruction. 
An example of an exception is a page fault on a 
memory read. For most instructions the VAX 
architecture requires that the program state be 
restored to what it was prior to the execution of 
the instruction so that, after the exception ser- 
vice sequence, the same instruction can be 
restarted. For some types of instructions, such 
as the string instructions, total program state 
restoration is impossible. In those cases, how- 
ever, enough of the state is saved and restored 
so that the instruction can continue its execu- 
tion from where it was interrupted. 

In the VAX 8600 CPU, the parts of the pro- 
gram state that must be restored consist of those 
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GPRs that have been modified during address 
calculation, and the instruction starting 
address. Some addressing modes, such as 
autoincremcnt and autodccrcmcnt, will modify 
the GPRs; such modifications arc kept in the 
Rlog. During instruction unwinding (also 
called instruction backup) , the ACU 
micromachinc will restore the affected GPRs 
from the Rlog. Since a number of instructions 
can reside in different stages of the pipeline 
simultaneously, the Rlog has enough entries to 
allow register restoration for multiple instruc- 
tions. The PC for the instruction in question 
will also be restored from either the GPG, ISA or 
KSA, depending on the state of the pipeline 
stages. This restoration mechanism is also used 
to handle interrupts. 

Branch Instruction Processing 
For most branch instructions, the I Box also 
calculates their target addresses and performs 
the branch decisions. These instructions 
include conditional (e.g., BFQL and BNtiQ) and 
unconditional (e.g., BRB) branches, as well as 
computed branches (e.g., ACBL and SOBGTR). 
Such decisions are made by looking at the 
appropriate condition code bits that result from 
an execution prior to the branch. The branch 
prediction scheme used here is biased towards 
branch taken, which is based on measured fre- 
quency of branching data. Figures 6a and 6b 
show an example of the microinstruction 
sequence for a branch instruction. 



During a conditional branch, the ACU 
micromachine holds the branch target address 
in the VA register and will attempt to initiate an 
instruction fetch from that address before it can 
make the branch decision. A condition code 
synchronization signal (CCSYNC) from the 
1- Box signifies that the condition code will be 
ready in the next physical cycle. In cycle 3, 
when a CCSYNC is received, the ACU 
micromachinc will issue the first request of the 
branch target instruction stream. In the next 
cycle, when the ACU receives the condition 
codes, it will use them to decide whether or 
not the branch is to be taken. If the branch is 
not to be taken, the decision will not be known 
early enough to inhibit the instruction fetch 
issued in cycle 3, due to signal delay. In that 
case a correction must be performed in cycle 4. 

A branch-taken decision (Figure 6a) means 
that the instruction prefetch request was cor- 
rect, and additional requests can be issued. The 
I Box then flushes the prefetcher and decode 
stages, which still hold the old instruction data, 
and allows the new instruction stream to be 
loaded and decoded. 

A branch-not-taken decision (Figure 6b), on 
the other hand, causes an abort of the prefetch 
request initiated earlier in cycle 3 from the tar- 
get address, thereby allowing the prefetcher 
and decode stages to resume the processing of 
the current instruction stream. There is no pen- 
alty for branch-not-taken here if the current 
instruction stream is already in the Ibuffer; the 
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cost of starting a new instruction stream is thus 
kept at a minimum. This scheme gives a simple, 
yet effective, mechanism to handle branches. 

The E Box is responsible for handling the 
remaining types of branches and other instruc- 
tions that can alter the instruction How. This 
includes CASE instructions, subroutine calls, 
and returns. The mechanism used is the same as 
that described for cold starts in the Instruction 
Prefetch section. 

Data Dependency Resolution 
The use of pipelining in the I Box requires 
additional mechanisms to resolve data depen- 
dency among instructions. Data dependency 
can happen in many situations; two key exam- 
ples are the following: 

■ Register conflicts, which happen when a 
source operand uses a register that is also the 
destination register of the previous instruc- 
tion. For example, in 

MOVE R0,R1 

MOVL (R1),R2 

the sourcing of R I by the ACl! in the second 
instruction must be inhibited until the first 
instruction is completed in the E Box. 



■ Memory conflicts if an out-of-order memory 
access is allowed. For example, in 

MOVL R0,(R1) 

MOVL (R2),R3 

if Rl equals R2, then the operand read for 
the second instruction must be postponed 
until the write in the first instruction is 
issued. This also mandates that additional 
collision-detection logic exists. 

The VAX 8600 I Box uses a register 
scoreboard and a single operand port to resolve 
both types of conflicts. The scoreboard pro- 
vides a simple reservation-table mechanism to 
accomplish this resolution. The ACU will enter 
the GPR number to the scoreboard for every 
register destination specifier the ACU 
processes. For every subsequent ACU sourcing 
from a GPR, the scoreboard is checked to detect 
any conflict. If such a conflict exists, the sourc- 
ing operation is temporarily inhibited via a 
scoreboard stall. A write to the GPR by the 
E Box will remove that GPR from the 
scoreboard, thus allowing the previously 
stalled sourcing operation to resume. In the 
VAX 8600 CPU, the scoreboard can be looked 
upon as a two-entry associative memory 
structure. 
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Figure 7 shows an example of the functions 
of the scoreboard for the instruction sequence 
discussed in the first example above. The func- 
tions performed in each cycle are described 
below. 

Cycle 1 The ACU is processing the MOVL 
R0,R1 instruction. The scoreboard at 
this time is assumed to be empty. 
The ACU reads RO and loads the ID 
register. The cycle is completed 
without problems. 

Cycle 2 The scoreboard is loaded with Rl as 
a result of the previous cycle. Since 
cycle 2 requires using Rl as the 
address source, the I Box control dis- 
covers that there is a scoreboard 
"hit" on Rl and the ACU 
micromachine stalls. It will subse- 
quently attempt to execute the same 
microinstruction during the next 
cycle. 

Cycle 3 The E Box can now execute the first 
iMOVL instruction, but the result will 
not be available until the beginning 
of cycle 4. As in cycle 2, the ACU 
micromachine still stalls in cycle 3. 



Cycle 4 The execution of the first MOVL 
instruction in the previous cycle by 
the E Box causes Rl to be drained 
from the scoreboard. The ACU can 
now continue and finishes the sec- 
ond iMOVL instruction. 

Cycle 5 The scoreboard is loaded with R2. As 
in the earlier stalled cycles, the ACU 
micromachine will not be able to 
complete the next MOVL if the next 
instruction uses R2 in operand evalu- 
ation. In that case the ACU 
micromachine will stall until a write 
to R2 is completed. 

Memory conflicts will not happen in the 
VAX 8600 CPU because the ACU micromachine 
controls both the operand read and write for 
most instructions via the operand port. The 
micromachine is sequenced in such a way that 
out-of-order memory access from the I Box is 
impossible. 

Certain instructions whose operand addresses 
may not be known at the time of decoding (e.g., 
bit field instructions) will be handled by the 
E Box. Operand fetching is done directly by the 
E Box via the E Box port (see Figure 2). In those 
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instructions the I Box suspends itself after the 
completion of the address calculation for all 
specifiers. Any new operand fetch requests 
from the operand port will be inhibited by an 
I Box suspension. This prevents a potential 
memory conflict from occurring when the 
I Box attempts to read operands for the next 
instruction while the current operand result 
has yet to be written by the E Box. 

Instruction Optimizations 
The I Box generates a number of optimizations 
that improve the performance of the CPU. For 
instructions using a GPR as the results destina- 
tion, the decode stage will also consume the 
GPR specifier during the decoding of the speci- 
fier immediately preceding it and present a sin- 
gle dispatch address to the execution stage. In 
addition to the source operand, the I Box also 
supplies the destination GPR address to the 
E Box. The E Box will use that address to access 
its local GPR file. This optimization removes 
one dispatch from the flow to the E Box. 

Another form of optimization eliminates 
scoreboard stalls when the source operand is in 
the same GPR to be updated in the future by the 
previous instruction. In this case the ACU will 
ignore the scoreboard stalling situation and will 
signal this fact by presenting a modified dis- 
patch address to the E Box. The E Box will sub- 
sequently access the correct updated GPR value 
from its own local copy. 

Pipeline Stage Synchronization 
As described earlier in the section on the 
VAX 8600 pipeline, interstage communication 
in the VAX 8600 CPU is done through a number 
of drain signals, as well as a few global flags. 
Each stage of the pipeline sets the valid flag of 
its output buffer to full when data is ready. The 
drain signal indicates to the stage that the 
buffer is going to be consumed by the successor 
stage. This will make the valid flag "empty." 
The global flags arc generally broadcast to most 
other stages. This interlock mechanism pro- 
vides the basis for the synchronization among 
the pipeline stages. 

Since each stage of the pipeline may take a 
varying number of physical cycles to complete, 
there are, at times, empty or full conditions in 
any of the pipeline stages. An empty condition 
occurs in a pipeline stage when it wants to 
drain its input buffer while it is empty. This 



condition will cause an input stall or a 
micromachine idling for lack of dispatch 
addresses. A full condition occurs in a pipeline 
stage when it wants to load its output buffer 
while that buffer is full. This condition will 
cause an output stall. Other reasons, such as 
resource contention, will also cause idling or 
stalling. 

Each stage uses a different scheme to handle 
such conditions. In both the prefetcher and 
decode stages, internal flags are maintained to 
indicate empty or full conditions. The 
prefetcher stage keeps track of the number of 
valid bytes in the Ibuffer and initiates a new 
prefetch, if necessary. Data removed from the 
Ibuffer by the decode stage will decrease the 
number of valid bytes, whereas new prefetched 
data will increase the number. When the Ibuf- 
fer is full, the prefetcher stage will have an 
output stall (i.e., no new prefetch requests will 
be issued). The decode stage loads the output 
buffer valid flag after each decode. It will 
assume an output stall if the buffer is not being 
drained by the ACU element. That element, in 
turn, can drain such a buffer during its execu- 
tion and clear the valid flag, thereby allowing 
decoding to be resumed. 

The ACU micromachine contains the most 
complicated stalling and idling mechanisms in 
the entire CPU. Most resource contention and 
dependency conflicts, as well as full and empty 
conditions, can occur in that micromachine. 

There are three types of stalling and idling in 
the ACU micromachine. 

1. Resource contention and busy, and 
dependency conflict stalls. Resource 
contention examples are (a) the simulta- 
neous update of a GPR by the OAU and 
execution stages, and (b) the use of 
certain buses by two resources at the 
same time. This is best exemplified by 
the register-dependency conflict detec- 
tion in the scoreboard. Another form of 
this kind of stall can result from memory 
requests not being accepted due to the 
M Box being busy (that is, while it is 
servicing previous requests). A full con- 
dition, which prevents any further pro- 
gress of execution, is another example. 
In general, for this type of stall, the 
micromachine will suspend the execu- 
tion of the current instruction and 
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resume it when the stall condition has 
been removed. 

Idling and nops. limpty conditions hap- 
pen in the ACU, for example, when the 
instruction decoder cannot provide a dis- 
patch address due to insufficient valid 
bytes in the Ibuffer. Another nop condi- 
tion is microtraps, which can be caused 
by unaligned data references or by the 
Hushing of the pipeline. In both cases 
the micromachine will execute the 
instruction, in the sense that a new micro 
program counter will be loaded, but 
none of the pertinent machine state will 
be modified. In the next cycle the 
micromachine will normally execute a 
new instruction generated through 
microtraps or the availability of the next 
dispatch address. 

Special stalls. In certain cases in which 
the purpose of the execution is only to 
supply dispatches to the II Box, the 
micromachine will stall to prevent an 
undesired modification of the state. Part 
of the state, such as Efork loading, is still 
allowed. This kind of stall occurs most 



often for single-byte instructions without 
any specifiers. In this case a superfluous 
dispatch address to the ACU 
micromachine is generated from the 
specifier field in the Ibuffer, but that 
address must not be executed lest it 
modify the state unintentionally. How- 
ever, the dispatch to the \l Box must still 
be loaded and the appropriate program 
counter updated. 

An Example 

An example is given in this section in order to 
get a more global view of the whole process of 
executing a piece of code in the VAX 8600 
pipeline. The program segment, shown in the 
Ii Box in Figure 8, employs two key mechanisms 
of the design: a branch and an I Box-write. The 
purpose of this example is to show the follow- 
ing aspects: 

■ The flow of many instructions through the 
pipeline, including their uses of the stages, 
elements and resources 

■ The state of the pipeline in any given physi- 
cal cycle, in order to understand the interac- 
tion among the various instructions active in 
the pipeline 
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Figure 8 shows how simple instructions, such 
as the first three in the example, flow through 
the pipeline in a straightforward way, using 
only one physical cycle per element. All pipe- 
line elements are then kept busy constantly, 
thus achieving the VAX 8600 CPU peak 
throughput of 12.5 MIPS, corresponding to the 
pipeline executing one macroinstruction per 
physical cycle. Notice that in this case the 
results are written to the GPRs, so that the 
Mem-write element is not utilized. Also, simple 
memory reads do not stall the pipeline but are 
performed in only one cycle in the Opfetch 
element. Moreover, the ACU micromachine 
immediately starts processing the next specifier 
after having issued a memory read request; 
related memory problems, if any exist, will be 
handled by the E Box. 

The branch instruction that follows in the 
example is one in which the branch is taken. It 
is therefore processed according to the mecha- 
nism described in the Branch Instruction 
Processing section and in Figure 6a. At the 
beginning of cycle 8, the CMPL instruction in 
the E Box sends a CCSYNC to the ACU element, 
which in turn issues an 1 buffer request (IBF in 
Figure 8) from the branch target address (TA in 
Figure 8). This request will result in the fetch- 
ing of the 1NCL instruction by the Ifetch ele- 
ment in cycle 9. Also in cycle 9, the condition 
codes (CC in Figure 8) computed by the CMPL 
instruction arrive at the ACU element, where 
they determine that the branch is to be taken. 
The ACU element then issues a "fl^iri" com- 
mand to the prefetcher and decode stages to 
make room for the new instruction stream. 
Notice that instruction execution will resume 
in the E Box only four physical cycles after the 
branch. This is a relatively small penalty for a 
branch, given that the pipeline latency is nor- 
mally six physical cycles. 

The INCL instruction that was prefetched by 
the branch mechanism arrives in cycle I 1 at the 
ACU element, where the operand effective 
address is loaded in the VA register. In the same 
cycle a memory-read request is issued and the 
operand address is kept in the VA register until 
the E Box is ready to do the write. The operand 
is fetched in cycle 1 2 and passed to the E Box in 
cycle 13. Then the E Box performs the incre- 
ment function, sends the result to the Mem- 
write element (into the write latch) and issues 
an I Box-write command (IBWRITE in Figure 8) 
to the ACU micromachine. This in turn issues 



the memory-write request to the M Box via the 
operand port (see Figure 2). The E Box waits 
two extra cycles after having issued the I Box- 
write command in order to handle potential 
memory problems, such as a page fault, before 
the ESA register is overwritten by retiring the 
instruction. Execution of the remaining instruc- 
tion stream resumes normally in cycle 16. 

Summary 

In this paper, a simplified model of pipeline 
implementations was introduced. In this 
model, a pipeline was described as a sequence 
of stages connected by a transport mechanism, 
which moves an item from the output buffer of 
a stage to its successor (i.e., a partial ordering). 
In connection with this model, the issues cru- 
cial to designing a pipeline were discussed, 
specifically in reference to the implementation 
of the VAX 8600 CPU and its instruction and 
operand fetch unit, the I Box. The most impor- 
tant of such issues are as follows: 

1 . The hand-off of items from one stage to 
the next-the issue of local versus global 
control 

2. Buffering, which relates to the number 
of items within a stage 

3- The contention for resources and associ- 
ated stall conditions 

4. The dependency of one stage on the 
activity of another stage (e.g., forward 
and backward dependencies) 

The significance of this implementation, and 
of the design presented here, lies in the suc- 
cessful resolution of the complex design 
problems that occur in the pipelined imple- 
mentation of modern architectures, such as the 
VAX architecture. Specifically, the use of a reg- 
ister scoreboard to prevent the use of stale reg- 
ister data, a facility to recover in the presence 
of exceptions, and synchronization mechanisms 
to deal with VAX-architecture specifics, such as 
unaligned references, can be considered a 
major accomplishment. The capabilities of this 
design-a fourfold speed improvement over the 
VAX- 11/780 CPU, and under favorable condi- 
tions, the ability of the I Box to deliver one 
instruction ever}' 80 nanoseconds to the E Box. 
which means a peak execution rate of 
12.5 MlPS-certainly make the VAX 8600 sys- 
tem a major engineering achievement. 
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The F Box, 

Floating Point 

in the VAX 8600 System 

The VAX 8600 system contains a processor-the F Box-that performs fast, 
accurate floating point calculations. The F Box logic design and algo- 
rithms are more efficient than those in the VAX-11 /780 system, a fact 
that greatly improves the performance of the 8600. The F Box has adder 
and multiplier modules that use macrocell array technology to perform 
the arithmetic calculations and polynomial evaluations. Logic control is 
achieved with microcode, which decreases the hardware required. Some 
interesting tradeoffs were made, especially to merge the microcode into 
the macrocell arrays. The resulting F Box design is a very reliable hard- 
ware and software package. 



One of our key design objectives for the 
VAX 8600 processor was to continue the domi- 
nant position of the VAX Family in the scientific 
computing market. That objective required 
the development of a floating point subsystem 
that met user demands for increased perform- 
ance and reliability. This paper describes how 
we achieved that objective in the 
VAX 8600 floating point accelerator (FPA) and 
the considerations that went into its design. We 
believe that the particular floating point algo- 
rithms chosen fit nicely with the component 
technology to yield a high-performance FPA 
with a relatively low cost. 

The F Box Operations Floiv 

Figure 1 shows the flow of operations in the 
VAX 8600 CPU. The F Box receives source 
operands over the operand bus (OP bus) from 
the I Box and delivers results over the write bus 
(W bus). These results arc stored in memory or 
in general purpose registers (GPRs) in the 
E Box and I Box, and in the F Box itself. The 
CPU allows two source operands to be 
processed in a single cycle by passing GPR 
identifiers between boxes. Each box has its own 
copy of the contents of all GPRs. Therefore, the 



I Box needs to pass only the number of a source 
operand GPR rather than the whole operand 
itself. This passing technique speeds up the 
flow of floating point instructions through the 
pipeline. The I Box passes the opcode of the 
instruction to the F Box along with the oper- 
ands. There, the opcode is transformed by the 
F Box Dispatch RAM (FDRAM) into decoded 
information that is used by the F Box control 
logic. 

The M Box has a 16KB cache that contains 
both instructions and data. This box performs 
the translation of virtual addresses into physical 
addresses, and it connects to the input/output 
(I/O) bus and the memory arrays. The E Box 
executes non-floating point instructions and 
controls the overall operation of the system. 
The E Box assists the F Box in executing instruc- 
tions and handles any overflow and underflow 
problems. 

For more information on VAX architecture, 
see reference 1. 

VAX Floating Point Formats and 
Instructions 

The VAX architecture supports four floating 
point formats: F, D, G, and H. F and D are 
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Figure 1 VAX 8600 Operations Floiv 



the formats from the original PDP-11 floating 
point processor (1971). These formats are 32- 
and 64 -bits wide respectively, and both have 8 
bits of exponent. The G and H formats were 
added later to the VAX- II architecture. These 
formats are 64 and 128 bits wide respectively, 
the G having I 1 bits of exponent and the H 
having 1 5. To achieve fast, efficient processing, 
fractions arc always normalized, and the lead- 
ing bit-the hidden bit-is not stored. 

F format instructions execute the fastest of all 
the floating point instructions on any VAX sys- 
tem and are used in most programs that require 
adequate precision (24 bits) and range (2 f27 to 
2 liS ). The 1) and G formats extend that preci- 
sion and range. The I) format provides 56 bits 
of precision, 3 more bits than the G. Usually, 
however, the extra range in the G format (2 l() ~ 3 
to 2 is more useful in performing calcula- 

tions. The 1) format is used in programs in 
which compatibility with earlier VAX systems 
and PDP-l I systems is important. 

In the 8600, the I) and G formats have 
approximately the same execution time, but 
H format instructions execute more slowly than 
the others. These H format instructions are 
implemented in the FPA but are intended for 
use as a backup format for intermediate calcula- 
tions in the I) and G formats." That use ensures 



that the final calculation result has sufficient 
precision and avoids overflow or underflow 
problems. 

The VAX architecture uses either 2- or 3- 
operand instructions for the basic operations of 
add, subtract, multiply, and divide. In the 
8600, the 2-operand instructions execute faster 
and are used by the compiler whenever practi- 
cal. That is certainly the case when the second 
operand is from a GPR, for then the I Box can 
optimize the passing of operands by passing the 
GPR number. 

In addition to the simpler instructions men- 
tioned above, the 8600 implements the com- 
plex I-MOD and POLY instructions for argument 
reductions and scries evaluations. ~ HMOD mul- 
tiplies two operands that have extended preci- 
sion and separates the result into integer and 
fraction components. POLY takes an argument, 
a degree, and a coefficient table and performs a 
series evaluation to yield a result. (Both 
instructions are executed with extra bits of pre- 
cision.) Complex mathematical functions can 
be completed in a few steps by using these 
instructions. For instructions involving integer 
multiplications, the F Box performs the actual 
computation while the \l Box handles the rest of 
the instructions. Those overlapping operations 
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decrease the execution time for the MI LL. 
liMUL, and INDEX functions. 

For programs in F and D formats, the execu- 
tion speed of the 8600 is about four times that 
of the I 1/780. For programs in (i and 
H formats, the execution speed is about twelve 
times faster, since those formats are not acceler- 
ated in the I 1/780. Table I contains the execu- 
tion times for some typical instructions 



Table 1 


Execution Times 




Instruction 


Operands 


Execution Time 
(nanoseconds) 


ADDF2 


Mem, R 


160 


MULF2 


Mem, R 


320 


DIVF2 


Mem, R 


1300 


POLYF 


argument, 
degree, table 


(1300 4 
6*degree*80) 


ADDG2 


Mem, R 


400 


MULG2 


Mem, R 


800 


INDEX 




1000 


EMULL 




640 



Macrocell Array Technology in the 
F Box 

The component technology used in the 
VAX 8600 system is the macrocell array (MCA), 
which provides about one thousand gate 
equivalents with a typical gate speed of one 
nanosecond. MCA utilizes emitter-coupled 
logic (HCL) technology in a 68-pin package 
that is one inch square with a maximum power 
dissipation of 5 0 watts. MCA technology is an 
extension of the gate array concept; but instead 
of gates, each cell in the array contains initially 
a number of unconnected transistors and resis- 
tors My creating interconnecting patterns with 
these components, a designer can transform 
them into small-scale and medium -scale inte- 
gration (SSI/MSI) logic functions, or "macros." 
These macros take the form of standard logic 
elements such as dual D-typc flipflops, dual full 
adders and quad latches. All are series-gated 
I-CL structures used in the 8600 to achieve 
optimized performance. 

The F Box has two modules, each containing 
MCAs. The F Box adder module has 24 MCAs 
and the F Box multiplier module has 2 1 ; in all, 



the F Box contains 17 different types of MCAs. 
The adder and multiplier modules are S layer 
(6 signal layers) printed circuit boards. Six sig- 
nal layers were needed because the amount of 
etched interconnect on these boards, up to 
9000 inches, could not be routed on our tradi- 
tional 4 -layer boards. The interconnect is main- 
tained at a controlled (transmission line) impe- 
dance to guarantee signal integrity. We found 
that the lowest failure rates are obtained when 
the integrated circuit components on the 
boards are cooled in a uniform manner. To 
achieve that cooling, we used wind tunnel 
techniques to develop algorithms that showed 
the optimum placement of those components. 
Moreover, for each module design, we ran com- 
puter programs to analyze the thermal profiles 
of the integrated circuits. These techniques 
allowed us to determine the best component 
placements to ensure the highest reliability. 

An integral part of the module design is a 
m u lti voltage bus bar that distributes power and 
also acts as a stiffener to maintain board flat- 
ness. On its edge, each module has 282 pins 
that can connect it to a 16-Iayer backplane. 

The connections from the F Box to the rest of 
the CPU had to be minimized in order to reduce 
the loading and propagation delays on the sig- 
nal lines. Therefore, only the adder module and 
the GPRs have interfaces to the W bus and 
OP bus. The adder module handles exponent 
calculation, normalization, rounding, and pack- 
ing of results. Since only the adder module con- 
nects to the CPU, the multiplier module must 
receive all of its operands from the adder. To 
increase the speed, we chose algorithms to min- 
imize the signal crossings between modules and 
between MCAs within a module; for example, 
addition calculations are done entirely within 
the adder module while multiplication calcula- 
tions stay within the multiplier module. The 
physical partitioning within each module 
required us to slice the various functions into 
"pieces'' that fit into one MCA. To minimize 
the number of operational shifts involved, the 
MCAs on the adder module were partitioned by 
functions, or horizontally. The MCAS on the 
multiplier module were sliced by data, or verti- 
cally. Figure 2 illustrates the physical partition- 
ing of the macrocell arrays in the F Box, as well 
as the MCAs on the adder and multiplier 
modules. 
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Figure 2 F Box Physical Partitioning 



Fach module is controlled by its own 
microcode, which is stored in 256 X 4 RAM 
components with access times of 7 nano- 
seconds. The RAM outputs are wired together in 
pairs to give 512 memory locations. Fach 
microword in the adder module has 48 bits, 
while each microword in the multiplier has 40. 

The 8600 has an 80-nanosecond cycle time, 
and each cycle has four subphases: TO, Tl, T2, 
and T3. The F Box cycle time is half as long, 
40 nanoseconds, and each cycle has just two 
subphases: TO 2 and T13. The storage elements 
in the F Box are level-sensitive latches with the 
clock pulses set as wide as possible without 
overlap. That technique yields a lot of flexibil- 
ity in the placement of the latches without 
slowing the data flow. Thus we got a simple and 
reliable clock system by having consecutive 
latches clocked with alternating clocks. Fach 
MCA needs only two pins for clock signals; 
thus, more of the available pins can be used for 
data signals. 

Arithmetic Algorithm Processing 

Addition and Subtraction Operations 
During an addition operation, groups of 32 bits 
come to the F Box from either the OP bus, the 
W bus, or a GPR and go into the fraction oper- 
and select logic (SOP). Each group also goes to 
the exponent processor (GXP, FXP) and the 
sign processor (GXP). Figure 3 depicts the 



MGA component connections on the adder and 
multiplier modules. 

The exponent processor calculates the expo- 
nent difference of the source operands to deter- 
mine which is the larger; the absolute value of 
that difference is used to align the fraction of 
the smaller operand. The alignment and 
unpacking of each number are combined into 
one shift by including the unpacking constant 
in the exponent calculation. The alignment 
count is passed on to the fraction adder (FAD) 
MCAs. The larger of the two exponents is kept 
by the exponent processor to complete the 
exponent calculation. 

In turn, the fraction bits are steered through 
the SOPs to the alignment circuits in the frac- 
tion alignment (ALN) MCAs. Here the hidden 
bit is restored, the exponent bits are cleared, 
the larger fraction is unpacked, and the smaller 
is partially unpacked and aligned. There are 
four of these ALNs, each containing eight bits of 
the data path, sliced such that every fourth bit 
is found in the same ALN. The alignment opera- 
tion is done in two phases: a byte shift by the 
ALNs, followed by a bit shift in the FADs. The 
data is then bit-shifted to complete the unpack 
and align operations and added or subtracted by 
the fraction adder logic in the eight FAD MCAs, 
with four adjacent bits to a slice. 

If an addition is being performed, the F Box 
sends a data-ready signal to the F Box to request 
access to the W bus for the next cycle. This 
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Figure 3 F Box MCA Components 



signal is also sent if a subtraction is being per- 
formed in which the exponent difference and 
the high-order bits indicate that the result will 
be ready within one CPU cycle. On the other 
hand, if the subtraction is performed on two 
numbers that are nearly equal, a large number 
of leading zeroes will result. Those zeroes must 
be normalized and the exponent must be 
adjusted before the data-ready signal can be 
sent. That process takes an extra CPU cycle. 

The fraction adders have a bit shifter for both 
alignment and normalization. In most cases, the 
number of leading zeroes is less than eight, so 
the bit shift and the rounding-constant add can 
be done in one pass. Simultaneously, the expo- 
nent processing logic receives the number of 
leading zeroes and adjusts the exponent for the 
final result. Then the hidden bit is masked and 
the result goes back to the four SOPs. There the 
result is packed into the F format and driven 
onto the W bus. The SOPs are sliced such that 



each contains every fourth bit of the result to 
allow shifting to within the nearest nibble 
(4-bit piece). The adder module can execute a 
typical ADDF (an add in F format) in four F Box 
cycles, or two CPU cycles. 

The hardware is arranged so that the "aver- 
age" floating point instruction executes 
quickly. The microcode steps through the 
sequence mechanically while enabling branch- 
ing to be performed whenever exceptional con- 
ditions are encountered. This branching will 
happen only when something atypical has 
occurred; for example, when the number of 
leading zeroes is greater than eight after the 
add. In that case, the result is passed through 
the SOPs, and back through the ALNs to be byte- 
shifted. Then the FADs complete the bit shift 
and the rounding. This process requires an 
additional CPU cycle to complete. 

The major difference between add operations 
in the F format and those in the D or G formats 
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is the handling of 32 additional bits of data in 
the latter two. Rather than making all the 
datapaths 64 bits wide, we opted to double- 
cycle the F Box relative to the rest or* the CPU. 
Thus the first group of 32 bits of a number in I) 
or G format is handled during one cycle and the 
second group is handled during a second cycle. 
As the first step of the path, the exponent 
processor calculates the exponent difference, 
an 8-bit operation in F format and an 1 1 -bit one 
in G format. Then the high-order fraction bits 
are unpacked, aligned and stored in a register in 
the ALNs and in another register in the FADs. As 
the low-order fraction bits arrive during the 
next CPU cycle, they are unpacked and aligned 
through that same path and merged with the 
appropriate bits in the FAD registers. 

The low-order fraction bits are then added 
together and that result is passed to the SOPs to 
be held in an internal register. In turn, the high- 
order fraction bits are added, and the low-result 
bits are passed back through the ALN to the FAD 
inputs-the assumption being that the number 
of leading zeroes is less than eight (no byte- 
shift is required). Once the high add is com- 
pleted, a leading-zero detector determines if 
that assumption is correct, which in most cases 
it is. 

Immediately after the high add, the low nor- 
malize-and-round is done. If it turns out that 
the number of leading zeroes is greater than 
eight, this result will be discarded. The 
microcode will guide the old sum through the 
byte shifter to the input for the fraction adder, 
yielding a normalization of up to 32 bits. If the 
microcode has not branched, the high normal - 
ize-and-round is done. At the end of this cycle, 
the hidden bit is masked and the result is 
passed to the SOPs, which then pack the high- 
result bits and drive them onto the W bus. One 
CPU cycle later, the low-result bits are driven 
onto the W bus. 

The total time spent in the F Box to perform 
operations on D and G formats is ten cycles, or 
a total of five CPU cycles. 

Multiplication Operations 
To perform multiplications, the operands are 
sent from the SOP MCAs to the F Box multiplier 
module. There the floating point formats are 
unpacked as follows: the leading bit is placed 
in the most significant position; the fraction 
bits follow the leading bit, in order of signifi- 
cance; and finally the cleared exponent and the 



sign bits. Figure 4 illustrates the conversion of a 
number in F format. 

The conversions of the D and G formats are 
similar, although they have 64 bits instead of 
32. Figure 5 illustrates the packed D and 
(i formats, where the G format has three extra 
exponent bits. The H format has 1 sign bit, 15 
exponent bits, and 1 1 2 fraction bits. 

Four multiplier select (MPR) MCAs are used 
to store the source operands. The MPRs feed the 
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multiplicand in 32-bit pieces and the multi- 
plier in 8-bit pieces to the multiply logic. 

On the multiplier module there arc 10 MCAs 
(9-MPY, 1-MPZ) that perform the actual multi- 
ply operation, each one generating a 4-bit slice 
of the product. Ihich MCA has column reduc- 
tion logic that consists of a 4-bit, 5-stage adder 
that adds the partial products, carries previous 
partial products, sums, and carries from the 
previous stage to create a new partial sum. All 
five stages are performed during every cycle. 
Each slice receives a byte of the multiplier and 
1 2 bits of the multiplicand. A trailing zero and 
two leading zeroes are concatenated to the mul- 
tiplier. Then it is divided into five groups of 
three bits each, called "triplets," to determine 
the Booth encoding. Finally, each triplet is mul- 
tiplied by the multiplicand according to the 
Booth algorithm. Figure 6 shows the eight bit 
combinations and the corresponding products. 

As each byte is multiplied, a 4 0-bit partial 
product is held in an accumulator latch. As the 
processing sequences, the product of the next 
8-bit multiplication is added to the last partial 
product in the accumulator latch, thus produc- 
ing a new partial product. This cycling contin- 
ues until the multiplicand has been multiplied 
by all the multiplier bytes. The normal execu- 
tion time is reduced by one cycle because the 
last byte of multiplier has the cleared exponent 
bits in it. 

In F format, the first 8-bit X 32-bit partial 
product is formed, then shifted 8 bits to the 
right and loaded into the accumulator. The next 
8 multiplier bits are multiplied by the multipli- 
cand, then added to the accumulator and 
shifted right by 8 bits, and finally stored. A third 
such product is formed, added to the partial 
product, and the result is stored in the 
extended accumulator latches, ready to go to 
the adder module. 

The D and G formats are processed in a simi- 
lar manner except that sixteen 8-bit X 32-bit 
multiplies are required to accomplish that task. 
After all of the multiplier bytes have been mul- 
tiplied by the least-significant 32 bits of the 
multiplicand, they then have to be multiplied 
by the most-significant 32 bits. Prior to the start 
of that multiplication, the partial product is 
shifted left by 24 bits to align it for subsequent 
addition to the next partial product. 

The Wallace Tree in Figure 7 illustrates the D 
and G format processing. 
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Since only 40 bits can be stored in the MPY 
slices, the overflow is sent to four extended 
accumulator chips in the result accumulator 
(MAX). During every cycle, the MAX receives 
the least significant byte from the accumulator 
in the MPY slices if a right-8-bit shift is being 
performed. Or, if a left- 24 -bit shift is being per- 
formed, the MAX receives the most-significant 
24 bits of the accumulator and gives MPY the 
24 least significant bits from previous accumu- 
lations. After a left shift, the MAX sends the 
most significant byte to the accumulator for the 
succeeding right-8-bit shifts. After all the multi- 
plications have been completed, the 64-bit 
result is stored in the MAX, ready to go to the 
adder module to be normalized, rounded, and 
packed. 

Division Operations 

The fraction adder (FAD) performs a non-restor- 
ing division algorithm, one bit per F Box cycle. 
A control input in the FAD causes the adder 
module to do an add or a subtract, depending 
on the carry out of the previous fraction adder 
operation. The bit shifter will keep shifting the 
dividend to the left by one bit every cycle. For 
the F format, this shift produces a quotient bit 
every F Box cycle, while the double precision 
formats, D and G, get a quotient bit every other 
F Box cycle. 

To save hardware on the adder module, the 
quotient bits are sent to the multiplier module, 
where a counter (a split between the MPR 
MCAs) and several shifters (in the MAX MCAs) 
are used to manipulate the quotient bits into 
the correct form. That is, the most significant 
bit is placed in bit position 3 1 , the next most 
significant bit in position 30 ... down to the 
least significant bit. Then, the bits are sent back 
to the adder module for normalization, round- 
ing, and packing. 

Exponent and Sign Processing 
For all operations, the exponent processors 
(FXP and GXP) calculate the result exponents 
based on the input operands and normalization 
counts. Each processor has an I 1-bit datapath 
for exponent operations and a 2-bit counter for 
accumulating carries and borrows out of the 
leading bit. Counters are used at the end of the 
instruction to detect overflow and underflow 
problems. A non-zero counter number indicates 
that a problem has occurred. In that case, the 
F Box sends a signal to the E Box when the 



calculation result is transmitted over the W bus. 
In turn, the storing of that result is prevented, 
and a section of the E Box microcode is 
"trapped" to a routine that reads several F Box 
status registers in the FBR MGA. The microcode 
then identifies the problem and initiates the 
exception processing. 

The sign processor in the GXP is a 1 -bit 
datapath, modeled in a fashion similar to the 
exponent datapath; in fact they share the same 
control and microcode signals. Instead of an 
adder, however, this processor uses a multi- 
plexer and an exclusive OR (XOR) gate to per- 
form sign operations. 

Polynomial Evaluations 
Polynomials are evaluated using Horner's 
Method, through a series of multiplications and 
additions. In the VAX 8600 system, the I Box 
prefetches coefficients from the M Box, and the 
E Box keeps track of intermediate results, dec- 
rements the degree, and deals with exceptions 
and address translations. The F Box performs 
the arithmetic steps described in the Addition 
and Multiplication sections above. All of these 
operations are done in parallel. 

Microcode Control in the F Box 

Like every other subsystem in the 8600, the 
F Box is controlled by microcode. Microcode 
offers a structured yet flexible and economic 
way of implementing the control functions. For 
complex instructions-such as polynomial eval- 
uations-microcode is essential for sequencing 
through the various steps. Even for the basic 
operations like add and multiply, microcode is 
helpful in dealing with unusual conditions. The 
achievement of a compact hardware design 
depended on the use of hardware units like 
adders and shifters for multiple purposes, and 
microcode provides sufficient control to 
achieve that design. Moreover, microcode con- 
trol allowed us the flexibility to implement 
fault detection and fault isolation procedures 
so that manufacturing and field service could 
effect repairs using microcoded diagnostic 
programs. 

We had to make several design restrictions in 
order to cycle the control store during each 
F Box cycle. For example, each module needed 
its own microsequencer and control store. And 
except for initial dispatching, the two 
microcodes run independently. 
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We latched the microword internals to the 
MCAs that used them in order to save propaga- 
tion time and to eliminate the need for addi- 
tional MSI components. The microfields were 
highly encoded due to the limited number of 
MCA pins available. That high level of encoding 
allowed us to make the whole control store rel- 
atively narrow-48 bits for the adder module 
and 40 bits for the multiplier module. That 
makes it easier for the F Box to access the con- 
trol store during each cycle. Inside the gate 
array, the F Box can decode the microcode into 
a large number of control functions, some of 
which are applicable over several cycles. The 
control signals are pipelined along with the 
data and the F Box gradually decodes those sig- 
nals at each stage (see Figure 8). The results of 
these data operations are sometimes fed back 
into later decode stages. This microcode style 
was needed, in particular, to accommodate the 
pipelined structure of the datapath, where sev- 
eral operations take place simultaneously. 

The result of those design restrictions was a 
scheme in which the microcontrol bits follow 
the data for several cycles, being further 
decoded at each stage. For the majority of cases, 
the microcode is little more than a decoding of 
the opcode, allowing the hardware to do align- 
ments, additions, normalizations, and round- 
ings. The microsequencer takes over only if the 
instruction does not fit the standard path and 
creates the needed result by using the available 
hardware functions. 



We had to define the operations at each cycle 
early in the design stage in order to get this 
tight fit between the microcode and the hard- 
ware. That was possible due to the relatively 
small number of operations involved in floating 
point processing. 

The short cycle time of the F Box complicates 
the control of microcoded branching. Each con- 
trol store location contains a NEXT ADDRESS 
field. To change control flow, the microcode 
selects up to three branch conditions at a time. 
The OR of these conditions and the low three 
address bits select the next microword to be 
executed. The selectors are controlled by a 
branch enable (BEN) field in the microword. 
The BEN field of a microinstruction does not 
affect the next micro-PC but does affect the one 
following it. (This is called "delayed branch- 
ing.") The delayed-branching algorithm com- 
plicates the microprogramming, since 
branches-in-progress always have to be 
accounted for. Figure 9 shows the different 
inputs and how they affect the next 
microaddress. 

The microsequencer contains no stall signals. 
Instead, the microcode branches on conditions 
that will force it to change flow. Again, that 
microcode design simplifies the hardware 
design, since stall conditions can be encoded 
into normal control signals. 

The I Box sends the opcode of the instruction 
to the multiplier module. There the opcode is 
used to address the dispatch RAM containing 
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Figure 9 The F Box Micro sequencer 



the starting microcode address for the instruc- 
tion. The same starting address is used for both 
microsequencers. The dispatch RAM also con- 
tains format bits that are used to control certain 
hardware operations. An instruction register 
decode (IRO) signal from the E Box triggers the 
start of a new instruction. A "flush" signal from 
the I Box is used to reset the microsequencer in 
case of a change in the instruction stream, nor- 
mally due to a branch or an exception. Without 
this signal, due to the pipelining of instruc- 
tions, the F Box might have started on a floating 
point instruction following the branch. Such an 
action would have put the F Box out of 
sequence with the 1 Box and E Box. 

The E Box has the ability to trap the F Box to 
various microcode routines. That ability is use- 
ful when the program wants to use the F Box to 
execute subroutine functions in complex 
instructions, or when the program wants to 
write customer-originated microcode in the 
E Box. 

Error Checking and Reporting Using 
Microcode 

High reliability was a major goal for the whole 
VAX 8600 system. We used very reliable parts, 
conservative design rules, and a small number 
of components to design an inherently reliable 
machine. Furthermore, we implemented exten- 
sive checking for errors throughout the CPU. 
Our primary recovery strategy was to retry the 



macroinstruction. If an error is detected, the 
CPU will make every effort to preserve its state 
so that the macroinstruction can be restarted 
after the error has been logged. 

The F Box has sufficient idle time to run diag- 
nostic tests on itself while non-floating point 
instructions are executing in the E Box. This 
idle time exists because the F Box is involved in 
the execution of only a subset of the total 
instruction set. In these tests, the opcode is 
used to address the FDRAM, and a dispatch 
address for microcode is generated for a test of 
appropriate length. Operands are gathered from 
the OP bus to create a variety of test patterns. 
The microcode test runs through the basic 
floating point operations and checks the result. 
If an error occurs, it will be logged by the F Box 
and reported to the E Box the next time that 
a floating point instruction is encountered. In 
this way, although the CPU is not disrupted, the 
F Box cannot be used until the error has been 
evaluated by the VMS operating system. 

The error analysis software processes the 
error report. Since the CPU does not require the 
F Box in order to run, it can be temporarily 
disabled by the operating system if the error 
frequency is sufficiently high. In that way, com- 
puting can continue until the F Box can be 
repaired. 

Like the other subsystems in the 8600, the 
F Box is connected to the maintenance proces- 
sor, the console, over the serial diagnostic bus 
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(SDB). The console and SDB are used to initial- 
ize the control store and other RAMs. The SDB is 
also used to alert the CPU to signals required to 
diagnose failures encountered in manufactur- 
ing test or at customers' sites. Parity errors in 
the control store are corrected on-line by the 
operator using the console. 
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James B. McElroy 

Packaging the 
VAX 8600 Processor 

Important packaging decisions were made early on the VAX 8600 pro- 
ject. First, the numbers of gates and parts were estimated to size the CPU. 
Then, a packaging evaluation method was developed to weigh the effects 
of various design factors. Packaging the components to control tempera- 
ture gradients was an important task. Several techniques for mounting 
devices were tried and the pin grid array was chosen. The module design 
is an equilibrium between component density and the number of signal 
layers. The tools developed for packaging decisions and the cooperation 
engendered among engineering disciplines will help future design 
projects. 



The role of packaging in the product develop- 
ment process has changed significantly in 
recent years. Today, the electronics packaging 
engineer must get involved earlier than ever 
before. He must make a vital contribution 
toward creating the actual design process, in 
addition to performing the traditional role of 
hardware design and evaluation. 

Accomplishing this expanded function 
requires the creation of effective and flexible 
tools for testing and evaluation, in addition to 
rigorous adherence to the best traditions of 
good engineering practice in the management 
of a large and complex project. The importance 
of such tools was compellingly demonstrated 
during the development of the VAX 8600 
processor. The tools developed and the lessons 
learned from designing the packaging for this 
machine can assist future computer design 
efforts by making product development more 
predictable. As a result, new systems can be 
developed in less time, with less cost and risk. 

At the beginning of a development project, 
little reliable information is available about the 
physical characteristics of the product. Gener- 
ally, packaging engineers are forced to rely on 
extrapolations from previous products and 
early estimates by system designers. But this ini- 
tial information is the basis for packaging and 
interconnect decisions that must carry through 



the development cycle and often through the 
life of the product as well. On the other hand, 
from time to time, it may be prudent to make 
"midcourse" corrections based on current 
developments and maturing technologies. 
Hence it is necessary to implement a design 
process that constantly inspects the ''state of 
the design" and provides early warning of 
potential problems. 

Ascertaining the Task 

Among the initial questions to be answered for 
any design project are, (a) what is the size of 
the task? (b) what will the product be made of? 
(c) what requirements must it satisfy? In addi- 
tion to the many safety and EMI regulations that 
the 8600 had to meet, we decided it was neces- 
sary to package the system in accordance with 
the new European standards for noise emissions 
in data processing equipment. These standards 
are considerably more stringent than those by 
which any previous Digital computers were 
built. 

Another early decision was to implement the 
CPU with LSI macrocell arrays (MCAs) sup- 
ported by small-scale and medium-scale inte- 
gration (SSI/MSI) emitter-coupled logic (ECL) 
and RAMs. An internal Digital maintainability 
study indicated that costs for spares could be 
reduced substantially by providing for on-site 
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replacement of MCA and RAM chips. Therefore, 
it was agreed that those components would be 
mounted in sockets. 

To determine the size and organization of the 
CPU, we worked initially with the logic design- 
ers to estimate the counts of gates, parts and 
modules and to determine the makeup of mem- 
ory and the I/O ports. Table 1 compares the 
numbers of gates and parts in the VAX-1 1/785 
CPU with the early estimates for the VAX 8600 
CPU. The last column gives the same data for 
the final product, some estimates were fairly 
close, others were not. Much of the increase in 
gate count comes from the increased use of 
pipelining to improve performance and from 
additional diagnostic features. This trend will 
continue in future design projects. 



Table 1 Gate and Part Counts 





VAX-1 1/785 


VAX 8600 






Earty 
Estimate 


Final 
Design 


Gates 


68K 


88.5K 


104K 


RAM bits 


1.06M 


1.05M 


1.04M 


SSI/MSI 


2600 


260 


1100 


MCAs 




141 


145 


Modules 


26 


10 


17 



To estimate the number of MCAs, besides the 
gate count estimate, we would have to have 
known the design efficiency factor-it is 
improbable that each array will use I 00% of the 
available cells due to routing inefficiencies and 
power/thermal limits. Initial component esti- 
mates arc rough at best, so a conservative safety 
factor was included to prevent difficulty when 
the actual counts became known. 

Evaluating the Choices 

Once it was determined what was being built, 
we faced a multitude of individual implementa- 
tion decisions related to choices of sockets, 
heat sinks, connectors, cables, and so forth. To 
facilitate the decision process, we developed a 
procedure for comparing the effects of the vari- 
ous alternatives in each instance and thus to 
help us select from among them. The first step 
in utilizing this procedure is to determine 



which system factors are significantly affected 
by the decision and the relative importance or 
"weight" of each (such that the weights sum to 
unity). Then for each factor some method is 
devised for quantifying the effect of each alter- 
native to arrive at a rating on a scale from 1 
(low) to 10 (high). Finally, in order to be able 
to compare the total "scores' 1 of the alternative 
solutions, the ratings were converted to "nor- 
malized" values by multiplying each by the 
corresponding weight. 

The alternatives and their impact on the vari- 
ous factors can be listed in a matrix; an exam- 
ple of this is shown in Table 2. Here the choice 
is between two overall packaging/interconnect 
structures, one using individual heat sinks to 
cool the MCAs, the other employing a heat-pin 
planar approach (both arc discussed later). Dif- 
ferent parameters play a role in different deci- 
sions. Often these parameters are difficult to 
quantify early in the project. It is important, 
however, to understand the relative differences 
between the competing concepts so that a rat- 
ing can be attached to each factor. 

Thermal Design 

Thermal design in the VAX 8600 processor was 
especially critical because individual MCAs can 
dissipate up to five watts. Both project risk and 
marketing considerations required using air 
convection for heat removal. We investigated 
two approaches to the problem. One employed 
an individual heat sink, or exchanger, on each 
MCA, wherein heat was conducted through the 
device carrier to an omnidirectional heat sink 
mounted by a thin layer of cpoxy. The other 
was a large, finned heat sink covering the entire 
back of the module. Conductive pins protrud- 
ing through the board conducted heat from the 
MCAs to the exchanger. In either case all other 
components were to be cooled in the tradi- 
tional way, by forced air convection. Figure 1 
depicts the "heat-pin" arrangement. Using 
heat-dissipating dummy devices, we conducted 
temperature and airflow experiments to deter- 
mine the thermal densities and device place- 
ments that would be used for the product. To 
predict temperatures, we used a thermal analy- 
sis tool developed by Digital's Thermal Engi- 
neering Group to model the actual modules as 
they would be in real operation. 

There were two possibilities for using the 
ganged heat exchanger. One involved a single 
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Table 2 Packaging and Interconnect Evaluation 



Individual Heat 
Sink Packaging 



Heat-Pin Planar 
Packaging 



oybiem racior 


weigni oi racior 


Rating 


Value 


Rating 


Value 


Reliability 


.20 


6 


1.20 


7 


1.40 


System Performance 


.15 


5 


.75 


5 


.75 


Risk 


.20 


7 


1.40 


4 


.80 


Development Cost 
(Engineering and 
Manufacturing) 


.10 


6 


.60 


3 


.30 


Spares Cost 


.10 


5 


.50 


4 


.40 


Design Process 


.05 


5 


.25 


3 


.15 


Acoustics 


.05 


5 


.25 


7 


.35 


Product Cost 


.10 


5 


.50 


6 


.60 


Signal Integrity 


.05 


5 


.25 


4 


.20 


Total Value 






5.70 




4.95 



68 I/O Chip Carrier 
(Leadless Type B) 




Card Cage (I/O & Memory) 

\ 



Heat Pin 
(Integral To Socket) 




68 !,/• Chip Carrier 
in Heat Pin Socket 



18-incn x 24- inch Planar P C Board 



Figure I Heat Pin Detail 



Figure 2 Heat Pin Planar Packaging 



exchanger on each module, with the module 
plugged into a backplane in the usual fashion. A 
novel planar approach was also considered in 
which all the CPU modules would be mounted 
on two sides of a large air heat exchanger. As 
shown in Figure 2, each plane contains several 
modules interconnected by flex circuitry, 
which also connects from one side of the plane 
to the other. This approach provided access to 
all of the components without disturbing inter- 
connect or cooling. 

Based on the weighting of the various param- 
eters in Table 2 plus other program considera- 



tions, we proceeded with the individual heat 
sinks and the standard module-to-backplane 
configuration. 

Regardless of the configuration selected, the 
cooling system had to deliver sufficient cooling 
air while conforming to the European noise 
reduction standards. To meet these needs, we 
devised a single-motor, four-wheel blower sys- 
tem to circulate the necessary air volume at 
appropriate pressure. An acoustic damping 
treatment applied to the enclosure doors 
reduced the noise emissions to an acceptable 
level. This packaging design not only met the 
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acoustic noise regulations, but also yielded a 
much quieter machine than any previous 
Digital computer of this size. 

Device Packaging 

To meet the objective of on-site replacement of 
LSI and RCL RAM devices, we decided to pro- 
vide sockets for them. Unfortunately, the relia- 
bility of sockets for MCAs was not well estab- 
lished, so it was necessary to provide an 
alternative scheme to hard-mount them. A 68 
I/O leadlcss chip carrier (LCC) met all of the 
requirements. 1 " Fven soldered-on clips could 
be used if necessary in place of the sockets. 

Since SSI/MSI and RAM devices were widely 
available only in DIP format, we decided to use 
that package type. Thus DIP sockets, several of 
which were already qualified in Digital, were 
used for RAM replacement. 

This mixture of component types forced us to 
choose a through-hole solder assembly tech- 
nique because Digital has no mixed soldering 
process (for surface-soldered and through-hole 
components on the same board). Therefore, 
both the socket (Figure 3, on the left) and the 
solder clips (Figure 4) for the MCA were 
designed in the through-hole configuration. To 
reduce the inductance, the socket has a parallel 
path for the device ground through the cover. 

During the course of the project, two 
problems arose relative to mounting the MCAs. 
The first was that the solder clip had to be 



installed by hand. At about the same time, 
Motorola indicated that they would develop a 
pin grid array (PGA) package for the MCA. By 
working closely with the vendor, we obtained a 
package (Figure 3, on the right) that matched 
the electrical performance and footprint of the 
LCC socket, allowing the substitution of the 
PGA for the solder clip as our backup. The next 
issue that arose was that sockets for the MCAs 
would not be available at sufficient quality 
levels within an acceptable time frame. At that 
point we switched to the PGA as the primary 
packaging technique. 

It was originally intended that the MCAs 
would themselves incorporate diagnostic hard- 
ware, but this feature was discovered to impair 
the yield. The solution to that problem-provid- 
ing supplementary hardware for diagnostics- 
created another: getting maximum hardware 
into minimum space. The module partitioning 
was already solidly established by the time we 
learned of the need for supplementary hard- 
ware. Fortunately, a SIP design, mounted with 
40-mil centerline chip carriers, enabled us to 
install the diagnostic hardware in the limited 
space available. 

Module Packaging 

The initial module choice was one similar to 
the printed wiring board used in the 
VAX- 1 1/750 system. It was the right size for our 
partitioning and density needs. However, to 




Figure J ICC with Socket and PGA Package 
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Figure 4 LCC ivith Solder Clips 




Figure 5 LSI Module 



provide the maximum number of edge finger 
pins for signals, supplementary power and 
ground connectors were developed. In this way 
we could get signal pins sufficient for the logic 
that would be put on the board (some of the 
282 pins are used for ground, but none for 
power). Figure 5 depicts the module, which is 
of controlled impedance construction and has 
eight layers, four of which are for signals. To 
ensure interconnect capacity, several trial lay- 
outs were done on early designs. 

As the system design progressed, the number 
of gates needed to perform the required func- 
tions grew significantly, as demonstrated by 
Table 1. Eventually all spare slots were used 
and more were needed. But in some areas addi- 
tional module crossings were unacceptable for 
reasons of system performance. So we decided 
to violate the rules for component density on 
the modules and added the extra gates to the 
modules already in place. Significant margins 
existed in power and cooling, but the intercon- 
nect was not adequate. We therefore had to add 
two signal layers to some of the modules. That 
posed a problem because, with traditional edge 
connectors, the extra signal layers had to be 
provided without any change in edge thickness. 
Two solutions to this problem were proposed. 
One involved a graduated layup in which the 
module itself would carry two more layers (a 
total of ten) while maintaining the eight-layer 
thickness at the connector. The other was an 
eight-layer construction with six, instead of 
four, layers for signal paths. When prototypes 
of each alternative were tested, the uniform 
eight-layer arrangement proved to be the satis- 
factory design, as it was easier to produce and 
less expensive. 

Backplane 

For the backplane we used a printed wiring 
board with the same routing grid and con- 
trolled impedance as the module boards. But 
the backplane has sixteen layers of which eight 
are for signal traces. To prevent problems due 
to Z-axis expansion during soldering, we used 
only compliant press-pin connectors. This also 
meant no drilling would be needed to add or 
delete nets because the press pins have wire- 
wrap tails for wire adds. Figure 6 shows a back- 
plane assembly mounted inside the system 
enclosure. Also visible is the power distribution 
structure, which can provide up to 4 00 
amperes of —5.2 volt current to the processor. 
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Lessons for the Future 

The experience of developing a physical design 
for the VAX 8600 processor demonstrated both 
the value of tools available to the package 
designer (e.g., the weighted comparison pro- 
cess and the thermal analysis software) and the 
need to improve those tools. 

In particular, the events of the device-level 
packaging phase indicate the need for a design 
database approach offering numerous built-in 
test points or decision thresholds. This process 
allows earlier identification of problems, ena- 
bling engineers to switch from one strategy to 
another without disrupting the project 
schedule. 

Similarly, the use of routing-prediction 
software derived from proven interconnect 
algorithms* reduces the incidence of routing 
inefficiencies. And it provides adequate safety 
margins in estimating gate and part counts at 
the beginning of a project. 



Two other product development objectives 
were revealed as a result of VAX 8600 design 
efforts. The first is the need to ensure that con- 
nector technology is not dependent on module 
thickness. Then extra layers can be added with- 
out greatly affecting related hardware. Second, 
larger safety margins must be provided to 
reduce module routing difficulties. 

Overall, the greatest need is for tools that 
provide accurate monitoring of design evolu- 
tion as a whole, and also in the individual 
regions of development. This is especially true 
given the great increases in complexity from 
one project to the next. Many people are 
involved in building a sophisticated computer 
system like the 8600, and everyone must know 
what the others are doing. 

As the industry continues to evolve and 
mature, it becomes essential that package 
designers communicate actively with system 
and logic designers, as well as manufacturing, 
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marketing, and customer support personnel. 
The development of tools and systems that pro- 
vide expanded insight into the progress of a 
whole project will assist packaging engineers in 
becoming creators of design processes as well 
as developers of hardware. 
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Signal Integrity in 
the VAX 8600 System 

Maintaining signal integrity in ECL is necessary for fast execution 
speeds. %n the VAX 8600 project, software tools were developed to elimi- 
nate signal problems before hardware was constructed. The number of 
signal layers was determined by modeling the components and routing 
channels. The worst-case noise margins were set on the basis of noise 
immunity. Power distribution can affect the margins, so special care was 
taken to limit transients. Temperature changes, which also cause signal 
level shifts, had to be limited. Waveforms and their reflections were 
modeled to identify the transient response. Another model identified 
crosstalk problems in parallel runs. 



To achieve the performance goals set for the 
VAX 8600 CPU, emitter-coupled logic (ECL) 
was chosen for implementing the design. This 
consists principally of custom macrocell arrays 
(MCAs), and standard series 10K logic and 
RAMs. The challenges and problems that utiliza- 
tion of this technology presented were investi- 
gated by studying an earlier ECL design at 
Digital. This investigation resulted in the allo- 
cation of signal noise margins and the recogni- 
tion of the need for new software tools for noise 
summation, and reflection and crosstalk analy- 
sis. As the design of the machine progressed and 
problems were encountered, we improved the 
new software to analyze whole networks and to 
allow as much flexibility as possible without 
risking the time to market. 

Printed Wiring Board 
Cha racteristics 

The first tasks were to select the characteristics 
of the printed wiring board (module) to be 
used and to determine the number of compo- 
nents that could be interconnected on it. The 
characteristics chosen were the following: 

1. The board will be the same height and 
width as that in the VAX-1 1/750 and 
VAX-1 1/780 systems. 



2. Board thickness will be limited by the 
card edge connector chosen for the 
system. 

3. The routing grid will be 5 0 mils to guar- 
antee a maximum of 5 percent backward 
crosstalk. 

4. Interconnect impedance will be main- 
tained at 55 ±5 ohms. 

Items 3 and 4 also apply to the printed wiring 
backplane that carries the signals between the 
modules. The minimum desirable impedance is 
50 ohms to match the minimum output drive 
capability of the MCAs (the MCA 25-ohm driv- 
ers are strictly for double-ended buses, where 
the lines in each direction are 50 ohms). The 
higher the impedance, the thicker the dielec- 
tric must be for a given signal conductor cross- 
sectional area. And the thicker the dielectric, 
the fewer layers that can be incorporated into a 
board of the maximum thickness (180 mils). 
Thus 55 ±5 ohms fits the requirements neatly, 
and within this constraint the backplane actu- 
ally reached the limit in number of layers. 

The number of components is obviously lim- 
ited by the available space-the area of the 
board. But it also depends on the number of 
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interconnections that can be made among those 
components. In investigating this issue, special 
consideration was given to signal IR drops due 
to interconnect length, as the voltage drop 
along a conductor directly subtracts from the 
noise margin at the input to the receiving gate. 
To solve this problem, different line widths 
were used in the different signal layers of each 
board. Signals could then be assigned to partic- 
ular layers depending on the length of the sig- 
nal path. Thus, longer lines could be assigned 
to wider signal traces to equalize the IR drops. 

With this information and the early compo- 
nent estimates from the logic designers, we 
determined the number of components on each 
board and how many signal layers would be 
needed to interconnect them. Then, from speci- 
fications of the amount of power consumed by 
each component, the total power drawn by 
each board and by the entire CPU were esti- 
mated. In turn, these estimates allowed us to 
determine the thickness of the copper in each 
module and in the backplane. At this point mak- 
ing layups of the hardware could begin. The 
result was that different modules in the CPU 
vary from two to six signal layers, and the CPU 
backplane has eight signal layers. 

Although this early analysis was useful, in the 
actual layout of the modules we ran into board 
routing problems. To solve them, a program 
was written based on Schmidt's article on esti- 
mation of circuit pack parameters using Rent's 
Rule'. As input the program requires the num- 
ber of components of each type on a board, the 
number of signal pins on each component type, 
the si/e of the board, and the number of routing 
channels between adjacent component pins. 
From this information, the program determines 
the number of signal layers required to route 
the board. 

We also created new programs to obtain bet- 
ter correlation between calculated printed wir- 
ing impedances and measured impedance val- 
ues (in other words, to obtain better 
prediction). These so-called "field" programs 
employ electromagnetic theory to simulate the 
inductance, capacitance, and resistance of con- 
ductors of arbitrary si/e and shape in two and 
three dimensions. From these simulated charac- 
teristics, the programs compute the electrical 
parameters for microstrip and stripline configu- 
rations, and the crosstalk between conductors. 
The three-dimensional program also computes 



the crossover capacitance of signal conductors 
that are on adjacent layers and routed orthogo- 
nally to each other. This last computation is 
important because the crossover capacitance 
increases the propagation delay of signal traces 
and lowers their impedance. Additional 
enhancements are being planned for these pro- 
grams to better analyze signal reference planes 
from an alternating current (ac) viewpoint. 

Noise Margins 

To design a reliable system, it is necessary to 
understand the direct current (dc) noise mar- 
gin for the ECL gates being used. Different logic 
families have different characteristics in the 
way tracking rates of input and output levels 
depend on variations in temperature and supply 
voltage. These variables were used to determine 
the worst-case dc noise margins, depicted in 
Figure 1 . However, if a system were to be 
designed around worst-case dc noise margins, 
then all the noise contributions summed 
together could not exceed those margins. This 
would be far more restrictive than necessary for 
system integrity and would be devastating for 
system performance. That is, the gates would 
have to be so far apart that the interconnect 
delay between them, on which system cycle 
time depends, would be unacceptable. 



Nominal High Level 

Guaranteed Lowes! 
High Output 



Guaranteed Highest 
Low Output 



Gate 
Output 



Gate 
Input 



Htgh-Level Noise Margin 



Range Within Whtch 
No Guarantee How 
Input Will Be 
Recognized 



Lowest Input Guaranteed 
To Be Recognized As High 



Highest input Guaranteed 
To Be Recognized As Low 



Low-Level Noise Margin 



Nominal Low Level 



Worst-Case DC N»ise Margin Is The Smaller 
Of The High- And L«w-Level Noise Margins 



Figure I DC Noise Margins 



On the other hand, by understanding the dc 
noise margin for a given gate, one can also 
obtain its ac noise margin. In particular, for 
each gate one can derive an input-signal ac 
noise immunity curve (Figure 2), which shows 
what amplitude of input noise is required to 
switch the gate output at any noise pulse width. 
Based on this relationship, if the sum of all 
input noise contributors for each gate in the 
system is less than the noise required to switch 
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Pulse Width 

Figure 2 AC Noise Immunity 

the output, then the integrity or* the system can 
be guaranteed. This criterion is much less 
restrictive than dc noise margins; in other 
words, any point below the curve or" Figure 2 is 
acceptable. Therefore, ac noise immunity was 
used to set the worst-case noise margins for the 
ECL logic in the 8600. 

Based on all known noise contributors, we 
determined the ac noise margin for the system. 
To set up the design rules, we then assigned an 
amplitude to each noise contributor; that is, the 
noise was allocated among the various sources, 
as shown in Table I. This allocation allowed us 
to define a routing grid on the printed wiring 
boards and backplane, and to select connectors 
and transmission line cables. 



Table 1 Noise Budget 

Allocation 
in 

Noise Contributor Millivolts 



Load reflections 


100 


Crosstalk 


100 


Interconnect mismatch impedance 


100 


Simultaneous switching of outputs 


150 


-2.0 Vac noise on signal line 


25 


Signal IR drop 


25 


V cc IR drop 


14 


Gate feed-through 


50 


Output voltage adjustment to thermal 


6 


variations 





Finally, we wrote a program to sum all noise 
contributions (worst case but without taking 
signal timing into consideration) for each ECL 
network in the system. Those networks identi- 
fied as potential problems were analyzed by 
hand using timing information to determine the 
impact on the system. Real problems were 
resolved by reducing one or more of the noise 
contributions (such as crosstalk from adjacent 
signal traces) or by spacing loads farther apart 
on the transmission line to reduce the ampli- 
tude of load reflections. 

Controlling Noise Sources 

The largest allocation in the noise budget is the 
one that little can be done about: the simultane- 
ous switching of outputs, which generates 
150 millivolts of noise. Other sources could be 
just as noisy, but the allocations for them 
reflect the fact that action can be taken to 
reduce them. Besides the use of different width 
traces to equalize signal IR drops, the major 
efforts lay in power distribution, load reflec- 
tions and crosstalk. 

Power Distribution 

Power distribution is an especially important 
factor in designing an LSI system with ECL. Sup- 
ply regulation is implemented through remote 
sense points located near the logic circuits. But 
the number of such points is necessarily lim- 
ited, and an excessive supply drop between a 
sense point and any ECL gate would adversely 
affect the noise margin. Of course, there must 
also be sufficient decoupling of the supply 
voltages. 

To obtain a reasonable dc noise margin on 
the ECL gates, a goal was set thai all factors 
contributing to variation in the supply voltage 
at any point in the distribution could cause no 
more than ±3 percent variation in the nominal 
V m voltage. Table 2 lists these factors and the 
allowable variation in each. 

ECL gates wired together are particularly sen- 
sitive to V ( ( voltage differences because the 
reference for both output and input thresholds 
is itself referenced to V ( ( . Furthermore, any ac 
noise on a V ( ( line not common to both gates 
may reduce the noise immunity. To minimize 
V cr differences and equalize ac effects, full 
ground reference planes were used in both the 
modules and the backplane. These planes keep 
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the inductance in the V ( ( path between chips 
as low as possible. 

To reduce the total power required by the 
system, we employed a smaller supply voltage, 
—2.0V, for the terminators. This allowed us to 
use a terminating resistance that matched the 
line impedance better. But it also created the 
possibility of large changes in terminator cur- 
rent over an entire module, a situation that 
would produce large transient voltages. Any 
noise in the terminator voltage is coupled in 
part onto the signal wires. To reduce these tran- 
sients, decoupling capacitors for both high and 
low frequencies were distributed throughout 
the modules. The specification is enough 
decoupling to limit transients to 50 millivolts 
on V n and V. [T . Table 2 also shows the allowa- 
ble variations in the factors affecting the termi- 
nator supply. 

Table 2 Power Supply Variation 

Variation in 

Factor V FF -5.2V V TT -2.0V 



Regulator tolerances 1.0% 2.0% 

Line/load regulation, 
ripple, long-term change 
in dc regulator output 

Noise transients due to 1.0% 2.5% 
load current changes 

Distribution IR drops 1.0% 1.0% 



Thermal Considerations 
The signal output and input levels of circuits 
shift with changes in temperature. To hold the 
dc noise contribution from this factor within its 
allocation required limiting to I ()°C the air 
temperature difference between any two 
devices connected together through any net- 
work. The thermal engineers attempted to guar- 
antee adherence to this criterion by holding the 
temperature rise across even 7 individual mod- 
ule to I 0°C. Since the heat generated by the 
different modules varies considerably, this goal 
turned out to be unattainable. But a thermal 
analysis of every network, including those that 
extended over multiple modules via the back- 
plane, showed that the fundamental require- 
ment relative to any two devices in any network 
was met. 



Load Reflection Analysis 
To analyze load reflections, we created a simu- 
lator that models a transmission line in the time 
domain. This program is specifically for ECL 
circuitry, and it gives results similar to those of 
SPICI: 2 but takes much less CPU time. To model 
a waveform at any point on a line, the simulator 
divides the total delay into many increments 
and calculates a set of values for the waveform 
corresponding to those increments. The calcu- 
lations take into consideration the (a) impe- 
dance and propagation delay of the line, 

(b) input and output impedances for each gate, 

(c) package capacitances, and (d) electrical 
parameters of signal connectors. Besides the set 
of values representing the generated waveform 
propagating along the line, the program also 
calculates a second set representing the 
reflected waveform. In a manner analogous to 
the result of a waveform and its reflection on 
the line, the corresponding values in the sets 
are summed. With this technique for waveform 
analysis, we can determine the transient 
response for each output and input on arbitrary 
networks. Using the appropriate differential 
equations to represent source and load models 
gives results that are comparable to those given 
by SPICK. 

Once a good correlation was obtained 
between bench measurements and simulations, 
we added algorithms to calculate the minimum 
and maximum propagation delays along each 
ECL network in the system. When the gate 
delays, interconnect delays, and appropriate 
logic conditions were established, we could 
analyze the timing of the VAX 8600 CPU using 
worst-case parameters. These parameters 
included both minimum and maximum values 
for gate delays, output rise and fall times, inter- 
connect delays, and impedances of the inter- 
connect for each logic path in the CPU The 
program that calculates interconnect delay can 
also analyze networks containing multiple 
sources (i.e.. wire-ORs and buses). 

Crosstalk and Interconnect 
As boards become denser and switching speeds 
faster, crosstalk becomes an increasingly impor- 
tant source of noise. The program for calculat- 
ing crosstalk, which can be used for TTL and 
ECL, finds all parallel pieces of signal etch on a 
board. It then calculates the crosstalk contribu- 
tion to each victim segment from all parallel 
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aggressor signal runs, within reasonable limits- 
it ignores those too far away. The calculations 
are based on the length and separation of paral- 
lel runs using crosstalk coupling coefficients 
rather than transmission line simulation. The 
voltages for each run are added and reported as 
the total crosstalk voltage coupled into the vic- 
tim network. If this total exceeds a specified 
threshold, the report includes a breakdown of 
the crosstalk for each run. 

Printed wiring that handles the signals 
between integrated circuits on boards and back- 
planes must be controlled impedance to obtain 
the best system performance. To meet the goals 
of the VAX 8600 system, at each interconnec- 
tion we permitted no more than 100 millivolts 
of reflection due to mismatches in impedance 
as a signal moves from one interconnect to 
another. 

Summary 

The initial performance goal for the design of 
the 8600 was a program execution speed at 
least four times that of the 1 1/780. One of the 
factors that made possible the realization of this 
goal was an investigation of the interconnect 
environment for the ECL logic used in the 
8600. In doing so we gained a significant 
understanding of and control over the follow- 
ing parameters affecting the integrity of the 
logic signals in the system: 

1 . Propagation delay per unit length of line 

2. Voltage drops from the source to each 
load 

3. Crosstalk between parallel signal lines 

4. Reflections due to loads on a transmis- 
sion line 

5. Reflections due to mismatched impe- 
dance characteristics of the line 

6. Reflections due to connector impedance 

7. Reflections due to mismatch between 
interconnect impedance and the 
terminator 



This understanding and control allowed us to 
perform accurate simulations of the intercon- 
nect delays through all paths in the CPU, 
resulting in the elimination of a large number 
of potential problems. Accurate timing simula- 
tions of the interconnect allowed the resolution 
of logic delay problems before committing the 
design to hardware, significantly reducing 
design turnaround times. 

Many people inside Digital worked diligently 
to generate programs and build test hardware to 
analyze the interconnect. Because of this, we 
were able to reach our goal of building a system 
with the caliber of the 8600. 
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Cooling the 

VAX 8600 Processor 

Proper cooling is essential for reliability yet is constrained by acoustic 
requirements. Both are achieved here using a single centrifugal blower 
to move air through the cabinet, with modules spaced for suitable air 
flow. Thermal models were created to analyze temperature gradients on 
modules and across networks, thus guaranteeing the integrity of signid 
levels. Component temperatures received special attention since an MCA 
can dissipate five watts and thus needs a heat sink. The best heat-sink 
design was developed by measuring die temperatures using testing 
devices, each containing a free diode. 



The VAX 8600 processor dissipates six kilo- 
watts of energy, nearly all of it from one 
double-width cabinet. Since the functionality 
of the logic is temperature sensitive, cooling 
was a major concern in building a reliable sys- 
tem. Nevertheless, the 8600 runs (and was fully 
qualified) on a solid floor using computer- 
room air for cooling. Of course the system can 
also be cooled by conditioned air drawn 
through a raised floor. Much of our cooling 
design effort was aimed at satisfying acoustic 
goals while at the same time meeting cooling 
requirements. The 8600 is the quietest 
machine of its size that Digital has ever built. 

Overall cooling of the 8600 is accomplished 
by the movement of air from bottom to top. Air 
at normal computer room temperature enters 
the cabinet through a perforated base panel and 
passes through an air-filter assembly that 
doubles as the UL drip screen. Should there 
ever be a fire inside the cabinet, the screen will 
extinguish the flames of any burning material 
that may drip from the equipment. From the 
screen the air passes through the card cage con- 
taining the logic and then through the power 
supplies. At the top of the cabinet is a double- 
dual centrifugal blower (i.e., a single device 
with a pair of wheels on each side) . The blower 
pulls the air up through the cabinet and forces 
it out through a pair of acoustic mufflers 
mounted inside the rear cabinet doors. Mount- 



ing the mufflers as an integral part of the rear 
doors allows easy access to the logic and power 
backplanes. The mufflers have an expanding 
internal cross section to regain as much static 
pressure as possible from the high-velocity air 
exiting the blowers. The muffler entrance and 
the exhaust louver pattern, respectively, are 
tuned to reduce inlet pressure losses and 
exhaust recirculation. The entire path is closed 
and independent of the outer walls of the cabi- 
net. Opening the cabinet doors does not impair 
the effectiveness of the cooling system. 

The card cage is made up of four sections, as 
shown in Figure 1 . From left to right, as viewed 
from the front, there are the memory, CPU, 
adapter bus, and I/O adapters and controllers 
that connect to the peripheral equipment. The 
memory and I/O sections have standard Digital 
0.5-inch slot spacing. Spacing in the adapter 
section and some CPU slots is 0.6 inch. The 
remaining CPU slots have 1 .0-inch spacing to 
provide the necessary component clearance 
and volume of cooling air How for those mod- 
ules containing macrocell arrays (MCAs). The 
greater clearance is required because each MCA 
must have an individual heat sink, and the high- 
powered MCAs require a greater volume of air 
for cooling. 

Any VAX 8600 processor may have a number 
of empty module slots that can otherwise be 
used for various options, such as the floating 
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Figure 1 Card Cage 



point accelerator, additional memory, a second 
connection to the adapter bus, and various I/O 
options. To prevent the cooling air from taking 
the path of least resistance through the gaps, 
plastic pseudo-boards arc installed in all 
unused slots. Eliminating the gaps keeps the air 
flow close to the boards where it belongs-cool- 
ing the components-and also serves to make 
the air flow characteristics of all machines the 
same. 

Module Thermal Design 

The thermal design is predicated on two crite- 
ria related to temperature. The first is that sig- 
nal levels in the emitter-coupled logic (ECL) 
components are shifted by changes in tempera- 
ture. Limiting the temperature difference 
between any two components within a network 
to 10°C prevents the logic levels from shifting 



out of range at one component relative to 
another. The second is that component failures 
are proportional to temperature. Holding down 
the component junction (die) temperature 
yields higher reliability-a longer MTBF. 

Early in the project we decided on two gen- 
eral goals to guide us in designing the 8600's 
cooling system to satisfy those criteria. One was 
to maintain a nominal air temperature rise 
across any given module at about 10°C. This 
would guarantee a maximum 10°C ambient dif- 
ference between any two components on the 
module. The other was to guarantee that at least 
90 percent of all die temperatures would be 
less than 100°C, even at the maximum ambient 
temperature of 32°C. Of course there were 
bound to be differences from one module to 
another; one module actually dissipates 180 
watts, and the one next to it dissipates 1 46 . 
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In those situations where the IO°C rise was 
exceeded, we analyzed the individual networks 
to determine the temperature gradients within 
them. On one board there was actually a 15°C 
rise, but no individual network exceeded 1 0°C; 
the goal for junction temperatures was met as 
well. By taking great care in the placement of 
components, we were able to configure the 
individual networks in such a way that even 
though we violated the general rule on temper- 
ature rise in some cases, we nonetheless always 
stayed within the critical limit on the tempera- 
ture difference between two devices wired 
together. 

To help the logic designers, we set up design 
rules aimed at satisfying the thermal require- 
ments. To start, one rule was based on dividing 
a board into sections about two inches wide 
(approximately a single column of compo- 
nents) and three inches high. The rule required 
that the components contained within each 
such section should not exceed a given maxi- 
mum power. The values for maximum power 
and section si/e were based on preliminary 
tests using a mockup board with prototype 
MCAs. With rising air flow, each component 
heats the one above it; we could not therefore 
allow the placement of a column of hot compo- 
nents, even if the rest of the board were cool, 
without evaluating each individual case. For 
example, to make routing possible, a designer 
may have needed to violate the section-power 
rule and put three five-watt devices right on top 
of each other. Cases such as this would be eval- 
uated by considering the network and die tem- 
perature information. 

At the next level of refinement, we used a 
thermal analysis tool designed by the Thermal 
Engineering Group at Digital in Maynard, Mas- 
sachusetts. This tool utilizes different functions 
of thermal resistance versus air velocity to cal- 
culate junction temperatures for different kinds 
of component packages. Based principally on 
vendor data, these functions were developed 
for plastic packages, ceramic packages, and 
packages with special thermal characteristics. 
Within each package type, separate curves were 
derived for different sizes, correlated to the 
number of pins. To perform the analysis, we 
divided the board into as many as one hundred 
sections. The "model board" was then popu- 
lated for a particular configuration by our spec- 
ifying the components and assigning them to 
the sections. The analyzer first calculates the 



temperature rise of the air from environmental 
information, power data, and component place- 
ment. From this calculated temperature rise, 
the appropriate thermal functions by compo- 
nent type and size, and air velocity information, 
the program predicts the junction 
temperatures. 

Although the analysis was extremely valua- 
ble, it was also very cumbersome to use. All the 
information on component type, size, power 
and position had to be entered by hand. A per- 
son required nearly a week to enter the data for 
one board. Once the value of the analysis had 
been demonstrated, we modified the program 
to take the component data from files supplied 
by the CAD tools that were already in place 
(drawing program and wirelister). Further- 
more, a set of algorithms and software was also 
developed that performed the section assign- 
ment automatically from layout data provided 
by the component placement optimization 
software. Eventually the handwork was reduced 
to five minutes, the time it took to select the 
number of sections and specify the input files. 

We also developed a network analyzer. Using 
this tool in conjunction with the module ther- 
mal analyzer allowed the inspection of the 
junction temperatures throughout an individual 
network on a board to determine whether the 
1 0°C rule was violated between any two com- 
ponents within that network. We used the mod- 
ule analyzer and the network analyzer on every 
board in the VAX 8600 processor. 

Once the logic design started to stabilize, we 
expanded the network analyzer to investigate 
individual networks that ran through multiple 
modules across the backplane. With the huge 
number of logic interconnections, this task was 
immense and complicated, but we did manage 
to complete it. Thus in the long run, thermal 
modeling was done at the device and board 
levels, and on the total machine. 

Component Themnal Design 

iMost of the time and effort in component ther- 
mal design was devoted to the MCA. This was 
because each MCA package can dissipate up to 
five watts. We tried many approaches relative to 
heat sinks and packages, with and without sock- 
ets, before settling on the final designs. We 
eventually arrived at a solid socket design for an 
MCA in a Ieadless chip carrier, but the sockets 
themselves were finally dropped (in favor of a 
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pin grid array package) because of insufficient 
availability. 

The major part of the testing was done to 
determine what heat sink to use based on the 
requirements of die temperature and allowable 
component-to-component temperature differ- 
entials. The die temperature is equal to the 
product of the power and the thermal resis- 
tance for a package plus the ambient tempera- 
ture. The vendor specification for the thermal 
resistance of the MCA is 10° ±2°C per watt. The 
LSI circuit is near the surface of the silicon. The 
major thermal path for the package extends 
(a) from the circuit; (b) through the silicon; 
(c) through the die bond, which is a gold- 
silicon eutectic solder; (d) through the alu- 
mina chip carrier; (e) through the epoxy that 
bonds the carrier to the heat sink; and 
(f ) through the heat sink into the air. Other 
paths to the air include heat convection from 
the surface of the ceramic and conduction 
through the leads into the board. 

Within these constraints, we had to select the 
heat sink and the epoxy to attach it. But in 
order to make these decisions, some way of 
actually measuring the die temperature was 
needed. The most promising technique seemed 
to be the traditional one of using a free diode as 
an internal thermometer. With constant cur- 
rent, a diode has a negative voltage/ 
temperature curve that is linear over small 
ranges. Since the ranges of concern were small, 
if there was a free diode on the device, we 
could calibrate it in a bath and then use its 
leads to monitor the die temperature. 

Unfortunately, an LCL device under power 
does not have any free diodes, so the vendor 
produced a special die just for thermal testing. 
This die was somewhat different physically 
from the MCA die, and it contained only TTL 
circuitry for making thermal measurements. 
Digital and the other companies using MCAs 
worked together to calibrate the die and use it 
to measure temperatures. At first we had con- 
siderable problems with instrumentation, learn- 
ing what to do and how to do it, and getting 
good dies from which reliable measurements 
could be gained. We built our own test equip- 
ment and developed procedures that allowed 
us to "look inside" the MCA packages. The suc- 
cess of this effort enabled us to select heat sinks 
that maintain the MCA temperatures at the 
desired levels. 



We decided to continue our testing on a 
device that better approximated the MCA in 
both size and structure. For this purpose, 
Digital's LSI facility in Hudson, Massachusetts 
provided two types of thermal test elements. 
The first incorporated the TIL die of a gate 
array used in the VAX-1 1/750 system. This 
device allowed us access to a free diode and 
was close to the right size for the MCA. It was 
mounted in the ceramic carrier of the MCA and 
allowed us to get a close thermal approxima- 
tion to an 8600 MCA package. Later the Hudson 
plant created an actual MCA on which they 
placed a "free" non-ECL diode just for thermal 
testing. The diode is not used in the logic of the 
device, and in normal production, it is not 
bonded to the I/O pads of the chip carrier. 
Whenever packages are required for thermal 
testing, the diode leads are bonded in place of 
two of the MCA output connections. This pro- 
cess renders the package useless for any other 
function, but perfect for thermal testing, since 
it is the actual structure of interest-an MCA 
die-and dissipates the actual power of the 
devices used in the 8600. With this "real" MCA 
package, we verified our thermal design by 
building a module with these parts in place of 
the actual MCAs. This "thermal module" can be 
placed in a machine and powered as if it were 
actually functional. The MCA packages contain- 
ing the special die can be monitored, allowing 
us to watch what really happens inside the 
machine. 

The experiments with the test devices also 
enabled us to investigate die bonding, or wet- 
ting. We wanted to know how much of the 
piece of silicon was actually soldered to the 
ceramic. The result of these studies enabled us 
to establish the specification for a test proce- 
dure that inspects the temperature of the die 
after it has been powered for a specified num- 
ber of seconds. If the die bond is poor, the heat 
will have to travel through a small void rather 
than through the higher conductivity solder; 
the die temperature will therefore be higher 
than a specified acceptable level. 

Switch to the Pin Grid Array Package 
Fairly late in the project, an acceptable pin grid 
array (PGA) package became available, and we 
decided that its advantages warranted using it. 
This meant that all the thermal investigations 
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had to be repeated to verity that the PGA con- 
figuration met the goals. 

Removing the sockets shortens the packages, 
so there is more space for air flow between 
those boards that already have the larger, one- 
inch spacing. It was feared that the new pack- 
age configurations might actually run too much 
cooler. We already had a fairly solid logic 
design that worked in the thermal configuration 
then existing; a significant temperature shift in 
either direction was undesirable. Running hot- 
ter reduces reliability; running significantly 
cooler, although it improves reliability, might 
affect the signal levels to such an extent that 
the system would not work at all. 

We studied the temperatures with the ther- 
mal module in every slot. Then we experi- 
mented extensively with a particular slot that 
was warmer than the others (the air flow is not 
exactly the same through all slots). The result 
of the investigation is a package in which the 
MCA runs slightly cooler than before, but still 
well within the signal level requirements. The 
heat sink is a single, four-finned unit, one inch 
in diameter. It is bonded to the top of the PGA 
package with an epoxy, and the whole assembly 
process is fully automated. 

Summary 

To cool a machine as large and as dense as the 
VAX 8600 processor requires the continuous 
movement of a very large volume of air. To do it 
with air at room temperature and go about it 
quietly is a significant feat indeed. It was 
accomplished by exercising meticulous care in 
the physical configuration of the system and by 
the creation of imaginative and thorough tools 
for thermal analysis. 
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Designing Reliability 
into the VAX 8600 System 

The failure rate of a system is directly related to the number of compo- 
nents used in its design. Therefore, the designers of a large CPU must put 
emphasis on fault avoidance, fault tolerance, and fault minimization to 
ensure that the overall system failure rate is acceptable. The VAX 8600 
system contains many features to assure its reliability. Conventional 
approaches, like parity checking, and nonconventional ones, like array 
address checking through ECC codes, were used to overcome the higher 
failure rate generated by having more components. This paper covers 
the most important steps taken to provide that reliability. 



The cost of a failure is proportional to the size 
of a system, since more compute power is lost 
and more people arc idled as size increases. 
Since the failure rate is directly related to the 
number of components in the system, a much 
greater emphasis must be placed on fault-toler- 
ant designs in larger systems in order to keep 
the costs of failures at an acceptable level. 1 The 
VAX 8600 system is the largest, most powerful 
computer produced by Digital Equipment Cor- 
poration. We made customer satisfaction the 
most important engineering goal, thereby plac- 
ing a high priority on the machine's reliability. 
In this paper, reliability is discussed from the 
customer's point of view, which covers a wider 
context than the usual definition of inherent 
reliability. 

Computer reliability enhancement can be 
subdivided into four areas: fault avoidance, 
fault tolerance, fault minimization, and 
improved mean time to repair (MTTR). Fault 
avoidance is realized by reducing the system 
failure rate through improved quality of the 
components, interconnects, design, and manu- 
facturing. Fault tolerance is the negation of the 
effects of faults through correction codes, 
redundant hardware, reconfiguration, and 
retry. 2 Fault minimization is the reduction of 



the effects of a fault by tagging corrupted data 
that has damaged the machine state or other 
data. Furthermore, fault minimization can be 
achieved by having the hardware give accurate 
and detailed fault information. The MTTR is 
improved through remote diagnosis, the reduc- 
tion of the time to diagnose a fault, and the 
increase of diagnostic accuracy. The applica- 
tion of each of these four areas to the VAX 8600 
design is discussed in detail in the following 
paragraphs. 

Before these details are presented, however, a 
short explanation of the major parts of the 8600 
architecture is warranted. The components in 
the VAX 8600 CPU are contained in four 
''boxes" that control operations and perform 
various functions. The E Box executes and 
retires instructions. The I Box prefetches and 
decodes instructions and prefetches operands. 
The M Box performs page translation, cache 
functions, I/O transfers, and memory array 
access. And the F Box performs floating point 
operations. 

Fault Avoidance 

Our first goal in designing a reliable system was 
to reduce the number of failures that occur in 
the machine. This involved getting 
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components, interconnects, and power systems 
with the lowest failure rates. Reducing the fail- 
ure rates also involved constantly monitoring 
the failures that were experienced and deter- 
mining their causes. 

A major influence on the IC reliability was 
exercised by specifying how the chips were to 
be stressed and tested. The DIPs and the 
macrocell arrays (MCAs) were required to be 
burned in before testing; thereafter, all chips 
were to be functionally tested. However, in 
debugging the early machines we discovered 
bad DIPs. We had expected to find only a hand- 
ful of bad chips since they were all burned in. 
To identify the cause of these failures, all defec- 
tive chips were analyzed. The problem was 
identified as static that was ''zapping" our mod- 
ules. Subsequently, the design was changed so 
that all machines come with static grounding 
straps. 

We also examined the designs of previous 
CPUs to determine which problem areas were 
typical. The backplane is an example. Wire- 
wrapped backplanes are difficult to build and 
test. They have several failure modes-such as 
cold flow of the insulation, a nicked wire, and 
scraps of wire. They can also be damaged dur- 
ing servicing of the machine. All these 
problems often result in intermittent faults that 
slowly but surely become more solid. Improv- 
ing the quality control on the wire-wrapping 
process to obtain the desired reliability was a 
very difficult task, since the process is com- 
prised of a large number of repetitive but not 
identical operations. Moreover, a very small 
error rate still produces quite a large overall 
failure rate. Therefore, early in the project, we 
decided to replace the wire-wrapped backplane 
with a multilayer printed circuit card, which 
has a much lower failure rate. 

In the power subsystem, fault avoidance was 
pursued by improving the alternating current 
(ac) input-power tolerance, the design testing, 
the manufacturing processes, and the environ- 
mental monitoring. In particular, manufactur- 
ing was a key area where the reliability of the 
power supplies was improved. A new power- 
supply tester was developed to improve our 
testing capabilities. It contains logic that can 
fully test the characteristics of a power supply 
and store the test data. The data includes line 
and load regulation and noise measurements. 

A modular power supply (MPS) was designed 
to run from a single clock so that all regulators 



would be in synchronization. This synchroniza- 
tion allowed us to predict and control the out- 
put noise of the switching regulators. A new 
high-current connector that allows the regula- 
tors to be pluggable was also developed. 

The power subsystem also contains the envi- 
ronmental monitoring module (EMM). The 
EMM was designed to monitor the status of the 
power supply and the environment inside the 
system. The EMM can measure the voltage out- 
put of every regulator, the inlet and outlet air 
temperatures, the air-flow velocity, and the 
ground-wire current in the primary power cord. 
The system protects itself by having the EMM 
monitor these conditions, log any deviations, 
and shut down the system if adverse conditions 
warrant it. 

According to E.J. McCluskey, "Improper 
design of the hardware or software can result in 
a system which does not function at all. Such 
mistakes are, of course, quickly discovered and 
corrected. Other, less obvious design defects 
usually remain in any system even after it has 
been in sen ice for a long time."* The results of 
design problems are logic circuits that either 
fail prematurely or sense signals falsely. The 
number of these types of errors is indirectly a 
measure of the quality of the tools used in the 
system's design. 

At the beginning of a design project, rules are 
established to make sure that the goals for sig- 
nal integrity and component failure rates can be 
achieved. It is usually impossible to develop 
rules that are both easy to check and at the same 
time don't overly constrain the design engineer. 
Often this results in complex rules. If they are 
inadvertently broken, the usual outcome is a 
decrease in the machine's reliability. The bro- 
ken rules result in components that operate 
with excessive temperatures or signals that do 
not have adequate noise margins. A chip that 
runs too hot will fail sooner than anticipated; a 
signal that doesn't have adequate noise margin 
will sometimes be sensed incorrectly. Worse 
still is the fact that the component is blamed 
rather than the true cause, a violated rule. 

As an example consider the operating tem- 
perature of an IC. There is a tradeoff between 
the maximum and minimum operating temper- 
atures and the amount of noise margin availa- 
ble. If the temperature of an IC exceeds its 
maximum specified temperature, the amount of 
noise normally present from known sources, 
such as adjacent-run crosstalk, may be 
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sufficient to produce a false signal. Therefore, 
it is important that all ICs stay within their 
specified operating temperatures. To ensure 
that, we developed a tool for use on the 8600 
to check for chips that were getting too hot. If a 
chip was detected as being too hot, its layout 
was modified to correct the problem without 
changing the total power of the module. 

A new timing analysis tool was also devel- 
oped for the project. This tool enabled the 
designers to do a much more thorough job of 
timing analysis on this machine than had been 
done on previous projects. Using it involved 
running many separate programs that built a 
timing model of the machine from the schemat- 
ics and the layouts of the modules, backplane, 
and MCAs. The results of the model were then 
used by a program that performed timing analy- 
sis of the design based upon a set of interbox 
timing specifications. 

After the layouts of the modules were com- 
pleted, every single run was analyzed to ensure 
that signal integrity had been achieved. The 
program computed the amount of noise gener- 
ated from adjacent runs, reflections, and the 
like. Based on these results, we made a number 
of reroutings to increase the integrity of certain 
signals. 

Fault Tolerance 

All the efforts discussed in the previous section 
improved the machine's reliability. However, 
the logic could still fail; therefore, it was 
important to have mechanisms to recover from 
a logic fault whenever possible. Fault isolation 
and fault tolerance are highly correlated, not 
separate issues. Data integrity and retry opera- 
tions depend on good fault detection. So does 
the ability to reconfigure the system when a 
fault occurs, a situation that requires accurate 
fault isolation as well. ' It is important to know 
what type of fault was made and what processes 
may or may not have been affected by it. To 
accomplish fault isolation, we had to develop 
an effective fault detection and reporting 
scheme. 

The design philosophy for the fault system 
had several major concepts. The first was that 
faults occurring synchronously with the pro- 
gram counter (PC) should be reported synchro- 
nously to it. Synchronous faults have a direct 
relationship to the current value of the program 
counter. For example, consider a write to an 
I/O register. Only one cycle is required for the 



M Box to accept all the information to perform 
the write operation. In the meantime, the E Box 
could continue processing instructions. The 
problem here is that if the I/O write has a fault, 
the current PC of the machine would have no 
fixed relationship to that fault, thus making 
recovery more difficult. To solv e this problem, 
the microcode will stall the E Box on an I/O 
write until the confirmation of that write is 
received. 

A similar problem exists with a translation 
buffer (TB) miss on a prefetch for the instruc- 
tion buffer. If a branch is ahead of the TB miss 
in the instruction buffer and the branch is 
taken, the TB miss will not be a problem and 
should not be reported. In this case the design 
requires a delay in sending the TB miss signal to 
the E Box (which performs the memory man- 
agement operations) until it attempts to exe- 
cute the instruction whose prefetching caused 
the TB miss. In general, synchronous faults are 
reported via E Box microtraps. 

Faults that are asynchronous to the program 
counter are reported asynchronously. Asynchro- 
nous faults are ones for which the value of the 
program counter has no definite relationship 
and which are usually reported through inter- 
rupts. Two examples of an asynchronous fault 
are a fault occurring on a disk write to memory 
and a parity error on a cache writeback 
operation. 

At the time a fault is detected, it may not be 
known whether the fault should be reported 
synchronously or asynchronously. In that case, 
both fault-logging mechanisms are invoked: a 
microtrap for synchronous faults and an inter- 
rupt for asynchronous ones. Consider the case 
of a parity error on an instruction prefetch. If 
the E Box executes a branch prior to using the 
bad data, the synchronization will never be 
reached and the fault will be logged through an 
interrupt. In this case the microtrap condition 
will be cleared by the execution of the branch. 
If, however, the E Box attempts to execute the 
prefetched instruction with the parity error, an 
E Box microtrap will occur and the trap routine 
will clear the interrupt. 

The second major concept used throughout 
the design was that hardware faults are consid- 
ered to be process faults only if a process 
attempts to use or store corrupted data. For 
example, if corrupted data is detected during a 
writeback to memory from the cache, a fault 
will be logged. However, the process will not 
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experience a fault until it attempts to cither 
consume the corrupted data or store it on a 
disk. This logic imposes the requirement that 
corrupted data be marked for later detection, 
which is done with ECC code in memory. This 
subject is discussed in the Unique Reliability 
Features section. 

Fault Minimization 

When recovery is not possible, the next best 
thing is to control the amount of damage done 
by a fault. This tactic requires fault information 
that is accurate, relevant, and sufficient. When- 
ever a fault occurs, an error stack frame will be 
constructed by the E Box and placed in mem- 
ory. The stack frame format is the same for all 
errors. We did not prejudge what would be use- 
ful in determining which information was 
relevant. 

In the case of damaged data, fault reporting 
alone is not sufficient, since it is not possible to 
determine which process will access that data. 
Therefore, when data damage occurs, the logic 
marks it as "bad," and any future user of that 
data will be notified of that fact. 

Mean Time to Repair 

There are two kinds of machine failures: those 
having solid fault symptoms, and those having 
intermittent fault symptoms. Of the two, solid 
faults are easier to diagnose. To isolate solid 
faults, the console can examine the state of the 
signals that go from one module to another. 
Diagnostics are run to find the first failed test, 
which is then run in a single-step manner to 
look for the first incorrect signal. With the 
exception of multiple-source signals, the 
source of the first incorrect signal value is the 
failing module (since all of its inputs have been 
checked by this process). In this way faults can 
be isolated to the field replaceable unit. 

Intermittent faults are much more difficult to 
diagnose, and they comprise between 80 per- 
cent and 90 percent of the faults. Diagnostics 
rarely provoke intermittent faults. But even 
when they do, the fault reporting can often be 
confusing. This confusion occurs because a 
logic fault will usually take place in a circuit 
after it has been tested and while another cir- 
cuit is being tested. The number of fault 
checkers in a machine affect its ability to know 
that a fault has occurred and to identify the 
failing unit. The probability of a fault occurring 
in the logic that any given checker has checked 



is not affected by whether the result is used or 
not. If an intermittent fault occurs on a path 
that is not being used, then no real fault has 
occurred. Therefore, the machine's overall reli- 
ability is increased by ensuring that fault check- 
ing is performed only on networks that are actu- 
ally being used. 

A detailed list of the checkers included in the 
VAX 8600 system is listed at the end of the 
paper. 

If a failure occurs that requires immediate 
power shutdown, then remote diagnosis 
through the console cannot be used. This 
occurs when the regulators detect an overheat- 
ing condition or the power for the EMM is out 
of tolerance. In these cases a magnetic indicator 
code that contains the failing regulator number 
will be displayed on the EMM module. This 
code enables a field service technician to know 
which regulator to replace. 

Unique Reliability Features in the 
VAX 8600 CPU 

In addition to the reliability features already 
discussed, the VAX 8600 design includes some 
not previously found on other Digital 
machines. These features are discussed under 
the four major areas used in the first part of this 
paper. 

Fault Avoidance 

The F Box executes self-diagnostics when it is 
not performing floating point instructions. 
These tests use "live" operands to enhance the 
detection of data-dependent faults. Both the 
E Box and the F Box are connected to a common 
source of instructions and operands. When the 
F Box detects that it cannot perform an opera- 
tion, it will execute a diagnostic self-test. 
Exactly which self-test is performed depends 
upon the instruction. The number of machine 
cycles in the diagnostic routine is chosen to be 
equal to or less than the number of machine 
cycles used by the E Box. This ensures that the 
F Box will always be ready for the next floating 
point operation that will be passed to it. If a 
fault is detected, the F Box will be turned off, 
and the E Box will perform the instruction that 
would have been done by the F Box, only at a 
much slower speed. 

Fault Tolerance 

The 8600 supports instruction retry where pos- 
sible. If a fault occurs that causes a microtrap 
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during an instruction, a set of instruction retry 
flags will be passed along through the various 
fault recovery stages. The flags indicate 
whether or not the CPU has performed an oper- 
ation that would make restarting the instruction 
impossible. An instruction retry would be 
inhibited if an abort bit is "on" for (a) an I/O 
read, (b) a memory write, (c) a state modify, or 
(d) the E Box. Otherwise, the instruction can 
be restarted. 

The data cache can recover from single-bit 
errors. A cache data entry consists of 32 bits of 
data, 4 bits of byte parity, and 7 bits of ECC. 
The write of the check bits is pipelined and 
occurs in the cycle following the write of the 
data. The parity bits are used for fault detection 
and the ECC bits for error correction. The 
M Box always passes data to the E Box or I Box 
before any checking is done. If the data con- 
tains a parity error, then either the E Box or the 
I Box, as well as the M Box, will detect it. The 
M Box will then block the acceptance of any 
more requests and will execute a data correc- 
tion sequence. The ECC code and the data are 
then sent to the array bus, and normal array-to- 
M Box data correction is applied. The "cor- 
rected word" is then written back into the 
cache. At some point the E Box will discover 
that it has been shipped bad data. The system 
will then retry the instruction if possible. The 
retry will be successful if the original fault was 
correctable. 

An important goal of the power subsystem is 
to increase its tolerance of bad ac input power. 



The power input is a true three-phase input 
with very low neutral current. In previous 
designs the power-storage capacitors had been 
attached to the regulator outputs. The detec- 
tion of power failures was performed by moni- 
toring the ac line. In contrast, the VAX 8600 
power system first converts power to 300 Vdc 
and then sends that power to regulators in order 
to produce the final output voltages. Power 
storage is done at the 300 Vdc level. This 
higher voltage allows more energy to be stored, 
since the storage is provided by capacitors. 
Power-failure detection is performed by moni- 
toring the voltage level on the 300 Vdc power 
supply. When its voltage reaches the level at 
which there is just enough energy remaining to 
perform a power-fail sequence, then an ac 
power failure will be declared. This method 
allows continued operation regardless of the ac 
input waveform, as long as the machine 
receives sufficient energy, a fact that is espe- 
cially helpful during brownout conditions. 

Fault Minimization 

The 8600 makes good use of the unassigned 
ECC codes (a 7-bit ECC can correct up to 57 
bits of data). They are used to detect array 
addressing problems and to flag any corrupted 
data. When a memory write occurs (see 
Figure 1), the parity of the address and an indi- 
cation of the quality of data are sent to the ECC 
generator. The quality of data is good if no 
faults were detected during its transmission to 
the M Box and bad if the machine suspects that 



Data 
To Be 
Written 



Data 
Address 



Data Bits 0:31 



Parity 
Generator 



Address Bits 4:3il 



Address 
Parity 



31 
32 
33 



ECC 
Generator 



Data Bits 32:38 



Quality 
Of Data 



-Syndrome 



Array 
Data 
Store 



Array 
Address 



Figure 1 Array Address Checking in the VAX 8600 System 
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a fault is present. The address parity and quality 
information are inserted into the ECC generator 
by means of bits 32 and 33 of the data. Neither 
of these bits is stored in the array. When the 
data is read back, the computed address parity 
is sent along with a good-data signal to the ECC 
generator. If the computed syndrome is zero, 
the transaction is considered to be good. If the 
ECC generator decodes a single-bit error point- 
ing to the address bit, then an address parity 
error will be declared. When that occurs, the 
word that was just received did not come from 
the address that it should have. Thus, the ECC 
generator can check the address lines from the 
M Box to the MOS array chips and detect the 
control faults that caused the M Box to access 
the wrong data word. If the chip thinks the 
quality bit needs correction, then the data word 
was faulty when it was received. The requester 
of this data will then be notified that the data is 
bad. If a normal single-bit error occurs on a data 
word that was stored with a code indicating bad 
quality, then the iV! Box will flag an ECC 
double-bit error. 

Most of the internal buses in the VAX 8600 
CPU as well as in the shifter and the arithmetic 
logic units (ALU) are parity checked. The ALUs 
are checked by triplication and parity checking 
the results. The I Box, F Box, and E Box each 
contain a set of general purpose registers 
(GPRs). When writes to the GPRs occur, all 
GPRs are written to simultaneously, thus keep- 
ing them consistent. If a GPR parity error is 
detected in one box, a recovery will be initi- 
ated that copies correct data from the 
equivalent GPR in another box to the failed 
GPR. Thus the machine can recover from GPR 
parity errors. 

Mean Time to Repair 

The number of microscqucnccrs in the 
VAX 8600 system also adds to its reliability. 
Ordinary combinatorial control logic is diffi- 
cult to check without duplication. Using a mic- 
roscquencer is one method of building control 
logic that is easily checked. For example, all 
the microcontrol stores arc parity checked. The 
M Box also checks the parity of the address, 
stack underflow and overflow, and stack 
address parity. Microparity errors are recover- 
able in the E Box, F Box, and I Box. These faults 
are not recoverable in the M Box since its state 
is modified in an unrecoverable manner before 
the parity computation is complete. 



Summary 

The task of making large machines reliable 
requires a continuous effort during all phases 
of the project, from conceptual design to manu- 
facturing. In the future, machines will continue 
to get larger. Unless some major technology 
breakthrough that significantly changes the 
reliability of components occurs-as occurred 
when transistors replaced tubes-the fault-han- 
dling capability designed into large systems 
must be improved. This improvement is needed 
to overcome the inherently higher failure rate 
that comes with having more components. 
Based on this conclusion, we created many 
design processes, manufacturing processes, and 
fault handling features that increased the relia- 
bility of the VAX 8600 system. Careful monitor- 
ing and simulation were required to ensure that 
true gains in reliability were actually achieved. 

Fault Checkers in the VAX 8600 
System 

In the E Box 

ALU Output Parity Check 

Shifter Parity Check 

Microcode Parity Check per Board 

Other RAM Store Check with Separate Error 
Flags 

AMUX Parity Check 
BMUX Parity Check 
GPR Copy Write Recovery 
Instruction Retry 
Diagnostic Fault Insertion 

In the M Box 

Memory Address Parity Check 

ECC on Cache and MOS Memory Data 

Writeback on SBE 

Microword Parity Check 

Microaddrcss Parity Check 

Microstack Parity Check 

Microstack Underflow/Overflow Detect 

A Bus Parity Check 

Array Bus Parity Check 

Corrupted Data Tag 

CPR Parity Check 
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In the F Box 

FBM Micro word Parity Check 

FBA Microword Parity Check 

FDRAM Parity Check 

GPRs Parity Check 

Self-test (when not executing instructions) 

In the I Box 

Microword Parity Check 

Ibuffcr Parity Check 

DRAM Parity Check 

GPR Parity Check 

OP Bus Parity Check 

W Bus Parity Check 

IMD Parity Check 
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